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dependency resolution is performed. If rpm discovers a missing dependency, rpm will 
exit with an error. 
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Removing a Package 


Packages can be uninstalled using either the high-level or low-level tools. 
Table 14-6 lists the high-level tools. 


Table 14-6: Package Removal Commands 


Style Command(s) 
Debian apt-get remove package_name 
Red Hat yum erase package_name 


For example, to uninstall the emacs package from a Debian-style system, 
we can use this command: 


apt-get remove emacs 


Updating Packages from a Repository 


The most common package management task is keeping the system up-to- 
date with the latest versions of packages. The high-level tools can perform 
this vital task in a single step (see Table 14-7). 


Table 14-7: Package Update Commands 


Style Command(s) 
Debian apt-get update; apt-get upgrade 
Red Hat yum update 


For example, to apply all available updates to the installed packages on 
a Debian-style system, we can use this command: 


apt-get update; apt-get upgrade 


Upgrading a Package from a Package File 


If an updated version of a package has been downloaded from a non- 
repository source, it can be installed, replacing the previous version (see 
Table 14-8). 


Table 14-8: Low-Level Package Upgrade Commands 


Style Command(s) 
Debian dpkg -i package_file 
Red Hat rpm -U package_file 


For example, to update an existing installation of emacs to the version 
contained in the package file emacs-22.1-7.fc7-i386.rpm on a Red Hat system, 
we can use this command: 


rpm -U emacs-22.1-7.fc7-i386.rpm 


dpkg does not have a specific option for upgrading a package versus installing one as 
rpm does. 


Listing Installed Packages 


Table 14-9 lists the commands we can use to display a list of all the packages 
installed on the system. 


Table 14-9: Package Listing Commands 


Style Command(s) 
Debian dpkg -1 
Red Hat rpm -qa 


Determining Whether a Package Is Installed 


Table 14-10 lists the low-level tools we can use to display whether a specified 
package is installed. 


Table 14-10: Package Status Commands 


Style Command(s) 
Debian dpkg -s package_name 
Red Hat rpm -q package_name 


For example, to determine whether the emacs package is installed on a 
Debian-style system, we can use this: 


dpkg -s emacs 


Displaying Information About an Installed Package 


If the name of an installed package is known, we can use the commands in 
Table 14-11 to display a description of the package. 


Table 14-11: Package Information Commands 


Style Command(s) 
Debian apt-cache show package_name 
Red Hat yum info package_name 
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For example, to see a description of the emacs package on a Debian-style 
system, we can use the following: 


apt-cache show emacs 


Finding Which Package Installed a File 


To determine what package is responsible for the installation of a particular 
file, we can use the commands in Table 14-12. 


Table 14-12: Package File Identification Commands 


Style Command(s) 
Debian dpkg -S file name 
Red Hat rpm -qf file name 


To see what package installed the /usr/bin/vim file on a Red Hat system, 
we can use the following: 


rpm -qf /usr/bin/vim 


Summing Up 


Chapter 14 


In the chapters that follow, we will explore many different programs cover- 
ing a wide range of application areas. While most of these programs are 
commonly installed by default, we may need to install additional packages 
if the necessary programs are not already installed on our system. With 
our newfound knowledge (and appreciation) of package management, we 
should have no problem installing and managing the programs we need. 


THE LINUX SOFTWARE INSTALLATION MYTH 


People migrating from other platforms sometimes fall victim to the myth that 
software is somehow difficult to install under Linux and that the variety of 
packaging schemes used by different distributions is a hindrance. Well, it is 
a hindrance, but only to proprietary software vendors that want to distribute 
binary-only versions of their secret software. 

The Linux software ecosystem is based on the idea of open source code. 


If a program developer releases source code for a program, it is likely that a 


person associated with a distribution will package the program and include it 
in their repository. This method ensures that the program is well integrated into 


the distribution, and the user is given the convenience of “one-stop shopping” 
for software, rather than having to search for each program’s website. Recently, 
major proprietary platform vendors have begun building application stores that 
mimic this idea. 

Device drivers are handled in much the same way, except that instead of 
being separate items in a distribution’s repository, they become part of the Linux 
kernel. Generally speaking, there is no such thing as a “driver disk” in Linux. 
Either the kernel supports a device or it doesn’t, and the Linux kernel supports 
a lot of devices—many more, in fact, than Windows does. Of course, this is 
of no consolation if the particular device you need is not supported. When 
that happens, you need to look at the cause. A lack of driver support is usually 
caused by one of three things. 


¢ The device is too new. Since many hardware vendors don’t actively support 
Linux development, it falls upon a member of the Linux community to write 
the kernel driver code. This takes time. 


The device is too exotic. Not all distributions include every possible device 
driver. Each distribution builds its own kernels, and since kernels are con- 
figurable (which is what makes it possible to run Linux on everything from 
wristwatches to mainframes), they may have overlooked a particular device. 
By locating and downloading the source code for the driver, it is possible 
for you (yes, you) to compile and install the driver yourself. This process is 
not overly difficult, but it is rather involved. We'll talk about compiling soft- 
ware in a later chapter. 


The hardware vendor is hiding something. It has neither released source 
code for a Linux driver nor has it released the technical documentation 
for somebody to create one for them. This means the hardware vendor is 
trying to keep the programming interfaces to the device a secret. Because 
we don’t want secret devices in our computers, it is best that you avoid 
such products. 
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STORAGE MEDIA 


In previous chapters, we looked at manipu- 

lating data at the file level. In this chapter, 
we will consider data at the device level. 

Linux has amazing capabilities for handling 

storage devices, whether physical storage such as hard 
disks, network storage, or virtual storage devices such 
as RAID (Redundant Array of Independent Disks) 
and LVM (Logical Volume Manager). 


However, because this is not a book about system administration, we will 
not try to cover this entire topic in depth. What we will try to do is introduce 
some of the concepts and key commands that are used to manage storage 
devices. 

To carry out the exercises in this chapter, we will use a USB flash drive 
and a CD-RW disc (for systems equipped with a CD-ROM burner). 
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We will look at the following commands: 


mount Mount a file system 

umount Unmount a file system 

fsck Check and repair a file system 

fdisk Manipulate disk partition table 

mkfs Create a file system 

dd Convert and copy a file 

genisoimage (mkisofs) Create an ISO 9660 image file 
wodim (cdrecord) Write data to optical storage media 


md5sum Calculate an MD5 checksum 


Mounting and Unmounting Storage Devices 
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Recent advances in the Linux desktop have made storage device manage- 
ment extremely easy for desktop users. For the most part, we attach a device 
to our system and it “just works.” In the old days (say, 2004), this stuff had to 
be done manually. On non-desktop systems (i.e., servers) this is still a largely 
manual procedure since servers often have extreme storage needs and com- 
plex configuration requirements. 

The first step in managing a storage device is attaching the device 
to the file system tree. This process, called mounting, allows the device to 
interact with the operating system. As we recall from Chapter 2, Unix- 
like operating systems (like Linux) maintain a single file system tree with 
devices attached at various points. This contrasts with other operating 
systems such as Windows that maintain separate file system trees for each 
device (for example C:\, D:\, etc.). 

A file named /etc/fstab (short for “file system table”) lists the devices 
(typically hard disk partitions) that are to be mounted at boot time. Here 
is an example /etc/fstab file from an early Fedora system: 


LABEL=/12 / ext4 defaults 11 
LABEL=/home /home ext4 defaults 12 
LABEL=/boot /boot ext4 defaults 12 
tmpfs /dev/shm tmpfs defaults 00 
devpts /dev/pts devpts gid=5,mode=620 0 0 
sysfs /sys sysfs defaults 00 
proc /proc proc defaults 00 
LABEL=SWAP-sda3 swap swap defaults 00 


Most of the file systems listed in this example file are virtual and not 
applicable to our discussion. For our purposes, the interesting ones are the 
first three. 


LABEL=/12 / ext4 defaults 11 
LABEL=/home /home ext4 defaults 12 
LABEL=/boot /boot ext4 defaults 12 


These are the hard disk partitions. Each line of the file consists of six 
fields, as described in Table 15-1. 


Table 15-1: /etc/fstab Fields 


Field Contents Description 
1 Device Traditionally, this field contains the actual name of a device 


file associated with the physical device, such as /dev/sdal 
(the first partition of the first detected hard disk). But with 
today’s computers, which have many devices that are hot 
pluggable (like USB drives), many modern Linux distribu- 
tions associate a device with a text label instead. This label 
(which is added to the storage media when it is formatted) 
can be either a simple text label or a randomly generated 
UUID (Universally Unique Identifier). This label is read by 
the operating system when the device is attached to the 
system. That way, no matter which device file is assigned to 
the actual physical device, it can still be correctly identified. 

2 Mount point The directory where the device is attached to the file 
system tree. 

3 File system type _ Linux allows many file system types to be mounted. Most 
native Linux file systems are Fourth Extended File System 


(ext4), but many others are supported, such as FAT16 
(msdos), FAT32 (vfat), NTFS (ntfs), CD-ROM (is09660), etc. 

4 Options File systems can be mounted with various options. It is pos- 
sible, for example, to mount file systems as read-only or to 
prevent any programs from being executed from them (a 
useful security feature for removable media). 

=) Frequency A single number that specifies if and when a file system is 
to be backed up with the dump command. 

6 Order A single number that specifies in what order file systems 
should be checked with the fsck command. 


Viewing a List of Mounted File Systems 


The mount command is used to mount file systems. Entering the command 
without arguments will display a list of the file systems currently mounted. 


[me@linuxbox ~]$ mount 

/dev/sda2 on / type ext4 (rw) 

proc on /proc type proc (rw) 

sysfs on /sys type sysfs (rw) 

devpts on /dev/pts type devpts (rw, gid=5 ,mode=620) 
/dev/sdaS on /home type ext4 (rw) 

/dev/sda1 on /boot type ext4 (rw) 

tmpfs on /dev/shm type tmpfs (rw) 

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) 
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sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 

fusectl on /sys/fs/fuse/connections type fusectl (rw) 

/dev/sdd1 on /media/disk type vfat (xw,nosuid,nodev,noatime, 
uhelper=hal , uid=500, ut f8, shortname=lower ) 

twin4:/musicbox on /misc/musicbox type nfs4 (rw, addr=192.168.1.4) 


The format of the listing is as follows: device on mount_point type 
filesystem_type (options). For example, the first line shows that device /dev/ 
sda2 is mounted as the root file system, is of type ext4, and is both readable 
and writable (the option rw). This listing also has two interesting entries at 
the bottom of the list. The next-to-last entry shows a 2GB SD memory card 
in a card reader mounted at /media/disk, and the last entry is a network 
drive mounted at /misc/musicbox. 

For our first experiment, we will work with a CD-ROM. First, let’s look 
at a system before a CD-ROM is inserted. 


[me@linuxbox ~]$ mount 

/dev/mapper/VolGroup00-LogVoloo on / type ext4 (rw) 
proc on /proc type proc (rw) 

sysfs on /sys type sysfs (rw) 

devpts on /dev/pts type devpts (rw, gid=5 ,mode=620) 
/dev/sda1 on /boot type ext4 (rw) 

tmpfs on /dev/shm type tmpfs (rw) 

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) 
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 


This listing is from a CentOS system, which is using LVM (Logical 
Volume Manager) to create its root file system. Like many modern Linux 
distributions, this system will attempt to automatically mount the CD-ROM 
after insertion. After we insert the disc, we see the following: 


[me@linuxbox ~]$ mount 

/dev/mapper/VolGroup00-LogVoloo on / type ext4 (rw) 

proc on /proc type proc (rw) 

sysfs on /sys type sysfs (rw) 

devpts on /dev/pts type devpts (rw, gid=5 ,mode=620) 

/dev/hda1 on /boot type ext4 (rw) 

tmpfs on /dev/shm type tmpfs (rw) 

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) 

sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 

/dev/sdc on /media/live-1.0.10-8 type iso9660 (ro,noexec,nosuid, nodev, uid=500) 


After we insert the disc, we see the same listing as before with one 
additional entry. At the end of the listing we see that the CD-ROM (which 
is device /dev/sdc on this system) has been mounted on /media/live-1.0.10-8 
and is type iso9660 (a CD-ROM). For the purposes of our experiment, 
we’re interested in the name of the device. When you conduct this experi- 
ment yourself, the device name will most likely be different. 


In the examples that follow, it is vitally important that you pay close attention to the 
actual device names in use on your system and do not use the names used in this 
text! Also note that audio CDs are not the same as CD-ROMs. Audio CDs do not 
contain file systems and thus cannot be mounted in the usual sense. 


Now that we have the device name of the CD-ROM drive, let’s unmount 
the disc and remount it at another location in the file system tree. To do this, 
we become the superuser (using the command appropriate for our system) 
and unmount the disc with the umount (notice the spelling) command. 


[me@linuxbox ~]$ su - 
Password: 
[root@linuxbox ~]# umount /dev/sdc 


The next step is to create a new mount point for the disk. A mount point is 
simply a directory somewhere on the file system tree. There’s nothing special 
about it. It doesn’t even have to be an empty directory, though if you mount 
a device on a non-empty directory, you will not be able to see the directory’s 
previous contents until you unmount the device. For our purposes, we will 
create a new directory. 


[root@linuxbox ~]# mkdir /mnt/cdrom 


Finally, we mount the CD-ROM at the new mount point. The -t option 
is used to specify the file system type. 


[root@linuxbox ~]# mount -t iso9660 /dev/sdc /mnt/cdrom 


Afterward, we can examine the contents of the CD-ROM via the new 
mount point. 


[root@linuxbox ~]# cd /mnt/cdrom 
[root@linuxbox cdrom]# 1s 


Notice what happens when we try to unmount the CD-ROM. 


[root@linuxbox cdrom]# umount /dev/sdc 
umount: /mnt/cdrom: device is busy 


Why is this? The reason is that we cannot unmount a device if the device 
is being used by someone or some process. In this case, we changed our work- 
ing directory to the mount point for the CD-ROM, which causes the device to 
be busy. We can easily remedy the issue by changing the working directory to 
something other than the mount point. 


[root@linuxbox cdrom]# cd 
[root@linuxbox ~]# umount /dev/sdc 


Now the device unmounts successfully. 
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WHY UNMOUNTING IS IMPORTANT 


If you look at the output of the free command, which displays statistics about 
memory usage, you will see a statistic called buffers. Computer systems 
are designed to go as fast as possible. One of the impediments to system 
speed is slow devices. Printers are a good example. Even the fastest printer 
is extremely slow by computer standards. A computer would be very slow 
indeed if it had to stop and wait for a printer to finish printing a page. In the 
early days of PCs (before multitasking), this was a real problem. If you were 
working on a spreadsheet or text document, the computer would stop and 
become unavailable every time you printed. The computer would send the 
data to the printer as fast as the printer could accept it, but it was very slow 
because printers don’t print very fast. This problem was solved by the advent 
of the printer buffer, a device containing some RAM memory that would sit 
between the computer and the printer. With the printer buffer in place, the 
computer would send the printer output to the buffer, and it would quickly be 
stored in the fast RAM so the computer could go back to work without wait- 
ing. Meanwhile, the printer buffer would slowly spool the data to the printer 
from the buffer’s memory at the speed at which the printer could accept it. 
This idea of buffering is used extensively in computers to make them faster. 
Don't let the need to occasionally read or write data to or from slow devices 
impede the speed of the system. Operating systems store data that has been 
read from and is to be written to storage devices in memory for as long as 
possible before actually having to interact with the slower device. On a Linux 
system, for example, you will notice that the system seems to fill up memory the 
longer it is used. This does not mean Linux is “using” all the memory; it means 


that Linux is taking advantage of all the available memory to do as much buff- 


ering as it can. 

This buffering allows writing to storage devices to be done very quickly 
because writing to the physical device is being deferred to a future time. In the 
meantime, the data destined for the device is piling up in memory. From time to 
time, the operating system will write this data to the physical device. 

Unmounting a device entails writing all the remaining data to the device 
so that it can be safely removed. If the device is removed without unmounting it 
first, the possibility exists that not all the data destined for the device has been 
transferred. In some cases, this data may include vital directory updates, which 
will lead to file system corruption, one of the worst things that can happen on a 
computer. 


Determining Device Names 


It’s sometimes difficult to determine the name of a device. In the old days, 
it wasn’t very hard. A device was always in the same place, and it didn’t 
change. Unix-like systems like it that way. When Unix was developed, 
“changing a disk drive” involved using a forklift to remove a washing 
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machine-sized device from the computer room. In recent years, the typical 
desktop hardware configuration has become quite dynamic, and Linux has 
evolved to become more flexible than its ancestors. 

In the examples in the previous section, we took advantage of the 
modern Linux desktop’s capability to “automagically” mount the device 
and then determine the name after the fact. But what if we are managing 
a server or some other environment where this does not occur? How can 
we figure it out? 

First, let’s look at how the system names devices. If we list the contents 
of the /dev directory (where all devices live), we can see that there are lots 
and lots of devices. 


[me@linuxbox ~]$ 1s /dev 


The contents of this listing reveal some patterns of device naming. 
Table 15-2 outlines a few of these patterns. 


Table 15-2: Linux Storage Device Names 


Pattern Device 
/dev/fd* Floppy disk drives. 


/dev/hd* IDE (PATA) disks on older systems. Typical motherboards contain two IDE 
connectors or channels, each with a cable with two attachment points 
for drives. The first drive on the cable is called the master device, and the 
second is called the slave device. The device names are ordered such 
that /dev/hda refers to the master device on the first channel, /dev/hdb is 
the slave device on the first channel; /dev/hdc is the master device on the 
second channel, and so on. A trailing digit indicates the partition number 
on the device. For example, /dev/hdal refers to the first partition on the 
first hard drive on the system, while /dev/hda refers to the entire drive. 


/dev/lp* Printers. 


/dev/sd* SCSI disks. On modern Linux systems, the kernel treats all disk-like devices 
(including PATA/SATA hard disks, flash drives, and USB mass storage 
devices such as portable music players and digital cameras) as SCSI 
disks. The rest of the naming system is similar to the older /dev/hd* nam- 
ing scheme previously described. 


/dev/sr* Optical drives (CD/DVD readers and burners). 


In addition, we often see symbolic links such as /deu/cdrom, /dev/dvd, and 
/dev/floppy, which point to the actual device files, provided as a convenience. 
If you are working on a system that does not automatically mount 
removable devices, you can use the following technique to determine how 
the removable device is named when it is attached. First, start a real-time 
view of the /var/log/messages or /var/log/syslog file (you may require super- 

user privileges for this). 


[me@linuxbox ~]$ sudo tail -f /var/log/messages 
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The last few lines of the file will be displayed and then will pause. Next, 
plug in the removable device. In this example, we will use a 16MB flash 
drive. Almost immediately, the kernel will notice the device and probe it. 


Jul 23 10:07:53 linuxbox kernel: usb 3-2: new full speed USB device using uhci_hcd and address 2 
Jul 23 10:07:53 linuxbox kernel: usb 3-2: configuration #1 chosen from 1 choice 

Jul 23 10:07:53 linuxbox kernel: scsi3 : SCSI emulation for USB Mass Storage devices 

Jul 23 10:07:58 linuxbox kernel: scsi scan: INQUIRY result too short (5), using 36 


Jul 23 10:07:58 linuxbox kernel: scsi 3:0:0:0: Direct-Access Easy Disk .00 PQ: O ANSI: 2 
Jul 23 10:07:59 linuxbox kernel: sd 3:0:0:0: [sdb] 31263 512-byte hardware sectors (16 MB) 

Jul 23 10:07:59 linuxbox kernel: sd 3:0:0:0: [sdb] Write Protect is off 

Jul 23 10:07:59 linuxbox kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through 

Jul 23 10:07:59 linuxbox kernel: sd 3:0:0:0: [sdb] 31263 512-byte hardware sectors (16 MB) 

Jul 23 10:07:59 linuxbox kernel: sd 3:0:0:0: [sdb] Write Protect is off 

Jul 23 10:07:59 linuxbox kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through 


Jul 23 10:07:59 linuxbox kernel: sdb: sdb1 
Jul 23 10:07:59 linuxbox kernel: sd 3:0:0:0: [sdb] Attached SCSI removable disk 
Jul 23 10:07:59 linuxbox kernel: sd 3:0:0:0: Attached scsi generic sg3 type 0 


After the display pauses again, press CTRL-C to get the prompt back. 
The interesting parts of the output are the repeated references to [sdb], 
which matches our expectation of a SCSI disk device name. Knowing this, 
these two lines become particularly illuminating: 


Jul 23 10:07:59 linuxbox kernel: sdb: sdb1 
Jul 23 10:07:59 linuxbox kernel: sd 3:0:0:0: [sdb] Attached SCSI removable disk 


This tells us the device name is /dev/sdb for the entire device and 
/dev/sdbl for the first partition on the device. As we have seen, working 
with Linux is full of interesting detective work! 


TIP Using the tail -f /var/log/messages technique is a great way to watch what the 
system is doing in near real-time. 


With our device name in hand, we can now mount the flash drive. 


[me@linuxbox ~]$ sudo mkdir /mnt/flash 
[me@linuxbox ~]$ sudo mount /dev/sdb1 /mnt/flash 
[me@linuxbox ~]$ df 


Filesystem 1K-blocks Used Available Use% Mounted on 
/dev/sda2 15115452 5186944 99775164 35% / 
/dev/sda5 59631908 31777376 24776480 57% /home 
/dev/sda1 147764 17277 122858 13% /boot 
tmpfs 776808 0 776808 0% /dev/shm 
/dev/sdb1 15560 0 15560 0% /mnt/flash 


The device name will remain the same as long as it remains physically 
attached to the computer and the computer is not rebooted. 
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Creating New File Systems 


Suppose that we want to reformat the flash drive with a Linux native file 
system, rather than the FAT32 system it has now. This involves two steps. 


1. (optional) Create a new partition layout if the existing one is not to our 
liking. 


2. Create a new, empty file system on the drive. 


In the following exercise, we are going to format a flash drive. Use a drive that con- 
tains nothing you care about because it will be erased! Again, make absolutely sure 
you are specifying the correct device name for your system, not the one shown 

in the text. Failure to heed this warning could result in you formatting (i.e., 
erasing) the wrong drive! 


Manipulating Partitions with fdisk 


fdisk is one of a host of available programs (both command line and graph- 
ical) that allows us to interact directly with disk-like devices (such as hard 
disk drives and flash drives) at a very low level. With this tool we can edit, 
delete, and create partitions on the device. To work with our flash drive, 

we must first unmount it (if needed) and then invoke the fdisk program as 
follows: 


[me@linuxbox ~]$ sudo umount /dev/sdb1 
[me@linuxbox ~]$ sudo fdisk /dev/sdb 


Notice that we must specify the device in terms of the entire device, not 
by partition number. After the program starts up, we will see the following 
prompt: 

Command (m for help): 


Entering an m will display the program menu. 


Command action 

toggle a bootable flag 

edit bsd disklabel 

toggle the dos compatibility flag 
delete a partition 

list known partition types 

print this menu 

add a new partition 

create a new empty DOS partition table 
print the partition table 

quit without saving changes 
create a new empty Sun disklabel 
change a partition's system id 
change display/entry units 

verify the partition table 


ie) 


<eoertunnvo sa BSsrane 
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w write table to disk and exit 
x extra functionality (experts only) 


Command (m for help): 


The first thing we want to do is examine the existing partition layout. 
We do this by entering p to print the partition table for the device. 


Command (m for help): p 


Disk /dev/sdb: 16 MB, 16006656 bytes 
1 heads, 31 sectors/track, 1008 cylinders 
Units = cylinders of 31 * 512 = 15872 bytes 


Device Boot Start End Blocks Id System 
/dev/sdb1 2 1008 15608+ b W95 FAT32 


In this example, we see a 16MB device with a single partition (1) that 
uses 1,006 of the available 1,008 cylinders on the device. The partition is 
identified as a Windows 95 FAT32 partition. Some programs will use this 
identifier to limit the kinds of operations that can be done to the disk, but 
most of the time it is not critical to change it. However, in the interest of this 
demonstration, we will change it to indicate a Linux partition. To do this, 
we must first find out what ID is used to identify a Linux partition. In the 
previous listing, we see that the ID b is used to specify the existing partition. 
To see a list of the available partition types, we refer to the program menu. 
There we can see the following choice: 


1 list known partition types 
If we enter 1 at the prompt, a large list of possible types is displayed. 
Among them we see b for our existing partition type and 83 for Linux. 
Going back to the menu, we see this choice to change a partition ID: 


t change a partition's system id 


We enter t at the prompt and enter the new ID. 


Command (m for help): t 

Selected partition 1 

Hex code (type L to list codes): 83 

Changed system type of partition 1 to 83 (Linux) 


This completes all the changes we need to make. Up to this point, the 
device has been untouched (all the changes have been stored in memory, 
not on the physical device), so we will write the modified partition table to 
the device and exit. To do this, we enter w at the prompt. 


Command (m for help): w 
The partition table has been altered! 


Calling ioctl() to re-read partition table. 


WARNING: If you have created or modified any DOS 6.x 
partitions, please see the fdisk manual page for additional 
information. 

Syncing disks. 

[me@linuxbox ~]$ 


If we had decided to leave the device unaltered, we could have entered 
q at the prompt, which would have exited the program without writing the 
changes. We can safely ignore the ominous-sounding warning message. 


Creating a New File System with mkfs 


With our partition editing done (lightweight though it might have been), 
it’s time to create a new file system on our flash drive. To do this, we will use 
mkfs (short for “make file system”), which can create file systems in a variety 
of formats. To create an ext4 file system on the device, we use the -t option 
to specify the “ext4” system type, followed by the name of the device con- 
taining the partition we want to format. 


[me@linuxbox ~]$ sudo mkfs -t ext4 /dev/sdb1 
mke2fs 2.23.2 (12-Jul-2011) 
Filesystem label= 
OS type: Linux 
Block size=1024 (log=0) 
Fragment size=1024 (log=0) 
3904 inodes, 15608 blocks 
780 blocks (5.00%) reserved for the super user 
First data block=1 
Maximum filesystem blocks=15990784 
2 block groups 
8192 blocks per group, 8192 fragments per group 
1952 inodes per group 
Superblock backups stored on blocks: 
8193 


Writing inode tables: done 
Creating journal (1024 blocks): done 
Writing superblocks and filesystem accounting information: done 


This filesystem will be automatically checked every 34 mounts or 
180 days, whichever comes first. Use tune2fs -c or -i to override. 
[me@linuxbox ~]$ 


The program will display a lot of information when ext4 is the chosen 
file system type. To reformat the device to its original FAT32 file system, 
specify vfat as the file system type. 
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[me@linuxbox ~]$ sudo mkfs -t wfat /dev/sdb1 


This process of partitioning and formatting can be used anytime addi- 
tional storage devices are added to the system. While we worked with a tiny 
flash drive, the same process can be applied to internal hard disks and other 
removable storage devices like USB hard drives. 


Testing and Repairing File Systems 
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In our earlier discussion of the /etc/fstab file, we saw some mysterious digits 
at the end of each line. Each time the system boots, it routinely checks the 
integrity of the file systems before mounting them. This is done by the fsck 
program (short for “file system check”). The last number in each fstab entry 
specifies the order in which the devices are to be checked. In our previous 
example, we see that the root file system is checked first, followed by the home 
and boot file systems. Devices with a zero as the last digit are not routinely 
checked. 

In addition to checking the integrity of file systems, fsck can also repair 
corrupt file systems with varying degrees of success, depending on the 
amount of damage. On Unix-like file systems, recovered portions of files 
are placed in the lost+found directory, located in the root of each file system. 

To check our flash drive (which should be unmounted first), we could 
do the following: 


[me@linuxbox ~]$ sudo fsck /dew/sdb1 

fsck 1.40.8 (13-Mar-2016) 

e2fsck 1.40.8 (13-Mar-2016) 

/dev/sdb1i: clean, 11/3904 files, 1661/15608 blocks 


These days, file system corruption is quite rare unless there is a hard- 
ware problem, such as a failing disk drive. On most systems, file system cor- 
ruption detected at boot time will cause the system to stop and direct you to 
run fsck before continuing. 


WHAT THE FSCK? 


In Unix culture, the word fsck is often used in place of a popular word with 
which it shares three letters. This is especially appropriate, given that you will 
probably be uttering the aforementioned word if you find yourself in a situation 


where you are forced to run fsck. 


Moving Data Directly to and from Devices 


While we usually think of data on our computers as being organized into 
files, it is also possible to think of the data in “raw” form. If we look at a disk 
drive, for example, we see that it consists of a large number of “blocks” of 
data that the operating system sees as directories and files. However, if we 
could treat a disk drive as simply a large collection of data blocks, we could 
perform useful tasks, such as cloning devices. 

The dd program performs this task. It copies blocks of data from one 
place to another. It uses a unique syntax (for historical reasons) and is usu- 
ally used this way: 


dd if=input_file of=output_file [bs=block_size [count=blocks]] 


| WARNING| The dd command is very powerful. Though its name derives from “data definition,” it 
as sometimes called “destroy disk” because users often mistype either the if or of speci- 

fication. Always double-check your input and output specifications before pressing 
ENTER! 


Let’s say we had two USB flash drives of the same size and we wanted 
to exactly copy the first drive to the second. If we attached both drives to 
the computer and they are assigned to devices /dev/sdb and /deu/sdc, respec- 
tively, we could copy everything on the first drive to the second drive with 
the following: 


dd if=/dev/sdb of=/dev/sdc 


Alternately, if only the first device were attached to the computer, we 
could copy its contents to an ordinary file for later restoration or copying. 


dd if=/dev/sdb of=flash_drive. img 


Creating CD-ROM Images 


Writing a recordable CD-ROM (either a CD-R or CD-RW) consists of 
two steps. 


1. Constructing an [SO image file that is the exact file system image of the 
CD-ROM 


2. Writing the image file onto the CD-ROM media 


Creating an Image Copy of a CD-ROM 


If we want to make an ISO image of an existing CD-ROM, we can use dd to 
read all the data blocks off the CD-ROM and copy them to a local file. Say 
we had an Ubuntu CD and we wanted to make an ISO file that we could 
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later use to make more copies. After inserting the CD and determining its 
device name (we’ll assume /dev/cdrom), we can make the ISO file like so: 


dd if=/dev/cdrom of=ubuntu.iso 


This technique works for data DVDs as well but will not work for audio 
CDs, as they do not use a file system for storage. For audio CDs, look at the 
cdrdao command. 


Creating an Image from a Collection of Files 


To create an ISO image file containing the contents of a directory, we use 
the genisoimage program. To do this, we first create a directory containing all 
the files we want to include in the image and then execute the genisoimage 
command to create the image file. For example, if we had created a direc- 
tory called ~/cd-rom-files and filled it with files for our CD-ROM, we could 
create an image file named cd-rom.iso with the following command: 


genisoimage -o cd-rom.iso -R -J ~/cd-rom-files 


The -R option adds metadata for the Rock Ridge extensions, which allows 
the use of long filenames and POSIX-style file permissions. Likewise, the 
-J option enables the Joliet extensions, which permit long filenames for 
Windows. 


A PROGRAM BY ANY OTHER NAME... 


If you look at online tutorials for creating and burning optical media like 
CD-ROMs and DVDs, you will frequently encounter two programs called mkisofs 
and cdrecord. These programs were part of a popular package called cdrtools 
authored by Jérg Schilling. In the summer of 2006, Mr. Schilling made a license 


change to a portion of the cdrtools package, which, in the opinion of many in 
the Linux community, created a license incompatibility with the GNU GPL. As a 
result, a fork of the cdrtools project was started that now includes replacement 


programs for cdrecord and mkisofs named wodim and genisoimage, respectively. 


Writing CD-ROM Images 


After we have an image file, we can burn it onto our optical media. Most of 
the commands we will discuss in the sections that follow can be applied to 
both recordable CD-ROM and DVD media. 
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Mounting an ISO Image Directly 


There is a trick that we can use to mount an ISO image while it is still on 
our hard disk and treat it as though it were already on optical media. By 
adding the -o loop option to mount (along with the required -t iso9660 file 
system type), we can mount the image file as though it were a device and 
attach it to the file system tree. 


mkdir /mnt/iso_image 
mount -t iso9660 -o loop image.iso /mnt/iso_image 


In this example, we created a mount point named /mnt/iso_image and 
then mounted the image file image.iso at that mount point. After the image 
is mounted, it can be treated just as though it were a real CD-ROM or DVD. 
Remember to unmount the image when it is no longer needed. 


Blanking a Rewritable CD-ROM 


Rewritable CD-RW media needs to be erased or blanked before it can be 
reused. To do this, we can use wodim, specifying the device name for the CD 
writer and the type of blanking to be performed. The wodim program offers 
several types. The most minimal (and fastest) is the “fast” type. 


wodim dev=/dev/cdrw blank=fast 
Writing an Image 


To write an image, we again use wodim, specifying the name of the optical 
media writer device and the name of the image file. 


wodim dev=/dev/cdrw image.iso 


In addition to the device name and image file, wodim supports a large 
set of options. Two common ones are -v for verbose output, and -dao, which 
writes the disc in disc-at-once mode. This mode should be used if you are 
preparing a disc for commercial reproduction. The default mode for wodim 
is track-at-once, which is useful for recording music tracks. 


Summing Up 


In this chapter, we looked at the basic storage management tasks. There 
are, of course, many more. Linux supports a vast array of storage devices 
and file system schemes. It also offers many features for interoperability 

with other systems. 
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Extra Credit 


It’s often useful to verify the integrity of an ISO image that we have down- 
loaded. In most cases, a distributor of an ISO image will also supply a 
checksum file. A checksum is the result of an exotic mathematical calcula- 
tion resulting in a number that represents the content of the target file. 
If the contents of the file change by even one bit, the resulting checksum 
will be much different. The most common method of checksum genera- 
tion uses the md5sum program. When you use mdSsum, it produces a unique 
hexadecimal number. 


md5sum image.iso 
34e354760f9bb7fbf85c96f6a3f94ece image.iso 


After you download an image, you should run md5sum against it and 
compare the results with the md5sum value supplied by the publisher. 

In addition to checking the integrity of a downloaded file, we can use 
md5sum to verify newly written optical media. To do this, we first calculate the 
checksum of the image file and then calculate a checksum for the media. 
The trick to verifying the media is to limit the calculation to only the portion 
of the optical media that contains the image. We do this by determining the 
number of 2,048-byte blocks the image contains (optical media is always 
written in 2,048-byte blocks) and reading that many blocks from the media. 
On some types of media, this is not required. CD-R and CD-RW disks written 
in disc-at-once mode can be checked this way. 


md5sum /dev/cdrom 
34e354760f9bb7FbF85c96f6a3Ff94ece /dev/cdrom 


Many types of media, such as DVDs, require a precise calculation of the 
number of blocks. In the following example, we check the integrity of the 
image file dud-image.iso and the disc in the DVD reader /dev/dvd. Can you 
figure out how this works? 


md5sum dvd-image.iso; dd if=/dev/dvd bs=2048 count=$(( $(stat -c "%s" dvd-image.iso) / 2048 )) 
| md5sum 
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NETWORKING 


When it comes to networking, there is 
probably nothing that cannot be done 


with Linux. Linux is used to build all sorts 

of networking systems and appliances, includ- 
ing firewalls, routers, name servers, network-attached 
storage (NAS) boxes, and on and on. 


Just as the subject of networking is vast, so are the number of com- 
mands that can be used to configure and control it. We will focus our 
attention on just a few of the most frequently used ones. The commands 
chosen for examination include those used to monitor networks and 
those used to transfer files. In addition, we are going to explore the ssh 
program that is used to perform remote logins. This chapter will cover 
the following commands: 


ping Send an ICMP ECHO_REQUEST to network hosts 


traceroute Print the route packets trace to a network host 


ip Show/manipulate routing, devices, policy routing, and tunnels 


netstat Print network connections, routing tables, interface statistics, 
masquerade connections, and multicast memberships 


ftp Internet file transfer program 

wget Non-interactive network downloader 

ssh OpenSSH SSH client (remote login program) 

We're going to assume a little background in networking. In this, the 
Internet age, everyone using a computer needs a basic understanding of 


networking concepts. To make full use of this chapter, we should be famil- 
iar with the following terms: 


e Internet Protocol (IP) address 
e Host and domain name 
e Uniform Resource Identifier (URI) 


Some of the commands we will cover may (depending on your distribution) require 
the installation of additional packages from your distribution’s repositories, and some 
may require superuser privileges to execute. 


Examining and Monitoring a Network 


Even if you’re not the system administrator, it’s often helpful to examine the 
performance and operation of a network. 


ping 

The most basic network command is ping. The ping command sends a 
special network packet called an ICMP ECHO_REQUEST to a specified 
host. Most network devices receiving this packet will reply to it, allowing 
the network connection to be verified. 


It is possible to configure most network devices (including Linux hosts) to ignore these 
packets. This is usually done for security reasons to partially obscure a host from a 
potential attacker. It is also common for firewalls to be configured to block ICMP traffic. 


For example, to see whether we can reach linuxcommand.org (one of our 
favorite sites), we can use ping like this: 


[me@linuxbox ~]$ ping linuxcommand.org 


Once started, ping continues to send packets at a specified interval (the 
default is one second) until it is interrupted. 


[me@linuxbox ~]$ ping linuxcommand.org 

PING linuxcommand.org (66.35.250.210) 56(84) bytes of data. 

64 bytes from vhost.sourceforge.net (66.35.250.210): icmp _seq=1 ttl=43 time=107 ms 
64 bytes from vhost.sourceforge.net (66.35.250.210): icmp _seq=2 ttl=43 time=108 ms 
64 bytes from vhost.sourceforge.net (66.35.250.210): icmp_seq=3 ttl=43 time=106 ms 
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64 bytes from vhost.sourceforge.net (66.35.250.210): icmp _seq=4 ttl=43 time=106 ms 
64 bytes from vhost.sourceforge.net (66.35.250.210): icmp _seq=5 ttl=43 time=105 ms 
64 bytes from vhost.sourceforge.net (66.35.250.210): icmp _seq=6 ttl=43 time=107 ms 


--- linuxcommand.org ping statistics --- 
6 packets transmitted, 6 received, 0% packet loss, time 6010ms 
rtt min/avg/max/mdev = 105.647/107.052/108.118/0.824 ms 


After it is interrupted (in this case after the sixth packet) by pressing 
CTRL-C, ping prints performance statistics. A properly performing network 
will exhibit 0 percent packet loss. A successful “ping” will indicate that the 
elements of the network (its interface cards, cabling, routing, and gateways) 
are in generally good working order. 


traceroute 


The traceroute program (some systems use the similar tracepath program 
instead) lists all the “hops” network traffic takes to get from the local system 
to a specified host. For example, to see the route taken to reach slashdot.org, 
we would do this: 


[me@linuxbox ~]$ traceroute slashdot.org 


The output looks like this: 


traceroute to slashdot.org (216.34.181.45), 30 hops max, 40 byte packets 

1 ipcop.localdomain (192.168.1.1) 1.066 ms 1.366 ms 1.720 ms 

2 * * * 
3 ge-4-13-ur01.rockville.md.bad.comcast.net (68.87.130.9) 14.622 ms 14.885 ms 15.169 ms 
4 po-30-ur02.rockville.md.bad.comcast.net (68.87.129.154) 17.634 ms 17.626 ms 17.899 ms 
5 po-60-ur03.rockville.md.bad.comcast.net (68.87.129.158) 15.992 ms 15.983 ms 16.256 ms 
6 po-30-ar01.howardcounty.md.bad.comcast.net (68.87.136.5) 22.835 ms 14.233 ms 14.405 ms 
7 po-10-ar02.whitemarsh.md.bad.comcast.net (68.87.129.34) 16.154 ms 13.600 ms 18.867 ms 
8 te-0-3-0-1-cr01.philadelphia.pa.ibone.comcast.net (68.86.90.77) 21.951 ms 21.073 ms 
21.557 ms 

9 pos-0-8-0-0-cr01.newyork.ny.ibone.comcast.net (68.86.85.10) 22.917 ms 21.884 ms 

22.126 ms 

10 204.70.144.1 (204.70.144.1) 43.110 ms 21.248 ms 21.264 ms 

11 cr1-pos-0-7-3-1.newyork.savvis.net (204.70.195.93) 21.857 ms cr2-pos-0-0-3-1.newyork. 
savvis.net (204.70.204.238) 19.556 ms cr1-pos-0-7-3-1.newyork.savvis.net (204.70.195.93) 
19.634 ms 

12 cr2-pos-0-7-3-0.chicago.savvis.net (204.70.192.109) 41.586 ms 42.843 ms cr2- 
tengig-0-0-2-0.chicago.savvis.net (204.70.196.242) 43.115 ms 

13 hr2-tengigabitethernet-12-1.elkgrovech3.savvis.net (204.70.195.122) 44.215 ms 41.833 ms 
45.658 ms 

14 csr1-ve241.elkgrovech3.savvis.net (216.64.194.42) 46.840 ms 43.372 ms 47.041 ms 

15 64.27.160.194 (64.27.160.194) 56.137 ms 55.887 ms 52.810 ms 

16 slashdot.org (216.34.181.45) 42.727 ms 42.016 ms 41.437 ms 


In the output, we can see that connecting from our test system to 
slashdot.org requires traversing 16 routers. For routers that provided identi- 
fying information, we see their hostnames, IP addresses, and performance 
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data, which includes three samples of round-trip time from the local sys- 
tem to the router. For routers that do not provide identifying information 
(because of router configuration, network congestion, firewalls, etc.), we 
see asterisks as in the line for hop number 2. In cases where routing infor- 
mation is blocked, we can sometimes overcome this by adding either the 
-T or -I option to the traceroute command. 


ip 

The ip program is a multipurpose network configuration tool that makes use 
of the full range of networking features available in modern Linux kernels. 
It replaces the earlier and now deprecated ifconfig program. With ip, we can 
examine a system’s network interfaces and routing table. 


[me@linuxbox ~]$ ip a 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group 
default 
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 
inet 127.0.0.1/8 scope host lo 
valid_lft forever preferred_lft forever 
inet6 ::1/128 scope host 
valid_lft forever preferred_lft forever 
2: etho: <BROADCAST ,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo fast state UP 
group default qlen 1000 
link/ether ac:22:0b:52:cf:84 brd ff: ff: ff: ff: ff: ff 
inet 192.168.1.14/24 brd 192.168.1.255 scope global etho 
valid_lft forever preferred_lft forever 
inet6 fe80::ae22: bff: fe52:cf84/64 scope link 
valid_lft forever preferred_lft forever 


In the preceding example, we see that our test system has two network 
interfaces. The first, called lo, is the loopback interface, a virtual interface that 
the system uses to “talk to itself,” and the second, called etho, is the Ethernet 
interface. 

When performing casual network diagnostics, the important things to 
look for are the presence of the word UP in the first line for each interface, 
indicating that the network interface is enabled, and the presence of a valid 
IP address in the inet field on the third line. For systems using Dynamic 
Host Configuration Protocol (DHCP), a valid IP address in this field will 
verify that the DHCP is working. 


netstat 


The netstat program is used to examine various network settings and 
statistics. Through the use of its many options, we can look at a variety of 
features in our network setup. Using the -ie option, we can examine the 
network interfaces in our system. 


[me@linuxbox ~]$ netstat -ie 
etho Link encap:Ethernet HWaddr 00:1d:09:9b:99:67 


inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 
inet6 addr: fe80::21d:9ff:fe9b:9967/64 Scope:Link 

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 

RX packets:238488 errors:0 dropped:0 overruns:0 frame:0 

TX packets:403217 errors:0 dropped:0 overruns:0 carrier:0 
collisions:0 txqueuelen:100 

RX bytes:153098921 (146.0 MB) TX bytes:261035246 (248.9 MB) 
Memory : fdfc0000- fdfe0000 


lo Link encap:Local Loopback 
inet addr:127.0.0.1 Mask:255.0.0.0 
inet6 addr: ::1/128 Scope:Host 
UP LOOPBACK RUNNING MTU:16436 Metric:1 
RX packets:2208 errors:0 dropped:0 overruns:0 frame:0 
TX packets:2208 errors:0 dropped:0 overruns:0 carrier:0 
collisions:0 txqueuelen:0 
RX bytes:111490 (108.8 KB) TX bytes:111490 (108.8 KB) 


Using the -r option will display the kernel’s network routing table. 
This shows how the network is configured to send packets from network to 
network: 


[me@linuxbox ~]$ netstat -r 
Kernel IP routing table 


Destination Gateway Genmask Flags MSS Window irtt Iface 
192.168.1.0 * 255.255.255.0 U 0 oO 0 etho 
default 192.168.1.1 0.0.0.0 UG 0 0 0 etho 


In this simple example, we see a typical routing table for a client 
machine on a local area network (LAN) behind a firewall/router. The 
first line of the listing shows the destination 192.168.1.0. IP addresses that 
end in zero refer to networks rather than individual hosts, so this destina- 
tion means any host on the LAN. The next field, Gateway, is the name or 
IP address of the gateway (router) used to go from the current host to the 
destination network. An asterisk in this field indicates that no gateway is 
needed. 

The last line contains the destination default. This means any traffic des- 
tined for a network that is not otherwise listed in the table. In our example, 
we see that the gateway is defined as a router with the address of 192.168.1.1, 
which presumably knows what to do with the destination traffic. 

Like ip, the netstat program has many options, and we have looked 
only at a couple. Check out the ip and netstat man pages for a complete list. 


Transporting Files over a Network 


What good is a network unless we can move files across it? There are many 
programs that move data over networks. We will cover two of them now and 
several more in later sections. 
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ftp 


One of the true “classic” programs, ftp gets its name from the protocol it 
uses, the File Transfer Protocol. FTP was once the most widely used method of 
downloading files over the Internet. Most, if not all, web browsers support 
it, and you often see URIs starting with the protocol ftp://. 

Before there were web browsers, there was the ftp program. ftp is used 
to communicate with FTP servers, machines that contain files that can be 
uploaded and downloaded over a network. 

FTP (in its original form) is not secure because it sends account names 
and passwords in cleartext. This means they are not encrypted, and anyone 
sniffing the network can see them. Because of this, almost all FTP done over 
the Internet is done by anonymous FTP servers. An anonymous server allows 
anyone to log in using the login name “anonymous” and a meaningless 
password. 

In the example that follows, we show a typical session with the ftp pro- 
gram downloading an Ubuntu iso image located in the /pub/cd_images/ 
ubuntu-18.04 directory of the anonymous FTP server /ileserver: 


[me@linuxbox ~]$ ftp fileserver 

Connected to fileserver.localdomain. 

220 (vsFTPd 2.0.1) 

Name (fileserver:me): anonymous 

331 Please specify the password. 

Password: 

230 Login successful. 

Remote system type is UNIX. 

Using binary mode to transfer files. 

ftp> cd pub/cd_images/ubuntu-18.04 

250 Directory successfully changed. 

ftp> 1s 

200 PORT command successful. Consider using PASV. 

150 Here comes the directory listing. 

-IW-YWw-Y-- 1 500 500 733079552 Apr 25 03:53 ubuntu-18.04-desktop-amd64.iso 
226 Directory send OK. 

ftp> 1cd Desktop 

Local directory now /home/me/Desktop 

ftp> get ubuntu-18.04-desktop-amd64.iso 

local: ubuntu-18.04-desktop-amd64.iso remote: ubuntu-18.04-desktop-amd64.iso 
200 PORT command successful. Consider using PASV. 

150 Opening BINARY mode data connection for ubuntu-18.04-desktop-amd64.iso (733079552 bytes). 
226 File send OK. 

733079552 bytes received in 68.56 secs (10441.5 kB/s) 

ftp> bye 


Table 16-1 provides an explanation of the commands entered during 
this session. 
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Table 16-1: Examples of Interactive ftp Commands 


Command Meaning 


ftp fileserver Invoke the ftp program and have it connect to 
the FTP server fileserver. 


anonymous Login name. After the login prompt, a password 
prompt will appear. Some servers will accept a 
blank password; others will require a password 
in the form of an email address. In that case, try 
something like user@example.com. 


cd pub/cd_images/ubuntu-18.04 Change to the directory on the remote system 
containing the desired file. Note that on most 
anonymous FTP servers, the files for public down- 
loading are found somewhere under the pub 


directory. 
Is List the directory on the remote system. 
lcd Desktop Change the directory on the local system to 


~/Desktop. In the example, the ftp program was 
invoked when the working directory was ~. This 
command changes the working directory to 
~/Desktop. 

get ubuntu-18.04-desktop-amd64.iso Tell the remote system to transfer the file ubuntu- 
18.04-desktop-amd64.iso to the local system. 
Since the working directory on the local system 
was changed to ~/Desktop, the file will be 
downloaded there. 


bye Log off the remote server and end the ftp pro- 
gram session. The commands quit and exit may 
also be used. 


Typing help at the ftp> prompt will display a list of the supported 
commands. Using ftp on a server where sufficient permissions have been 
granted, it is possible to perform many ordinary file management tasks. It’s 
clumsy, but it does work. 


Iftp—a Better ftp 


ftp is not the only command-line FTP client. In fact, there are many. One of 
the better (and more popular) ones is 1ftp by Alexander Lukyanov. It works 
much like the traditional ftp program but has many additional convenience 
features including multiple-protocol support (including HTTP), automatic 
retry on failed downloads, background processes, tab completion of path 
names, and many more. 
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wget 

Another popular command-line program for file downloading is wget. It is 
useful for downloading content from both web and FTP sites. Single files, 
multiple files, and even entire sites can be downloaded. To download the 
first page of linuxcommand.org, we could do this: 


[me@linuxbox ~]$ wget http: //linuxcommand.org/index. php 
--11:02:51-- http://linuxcommand.org/index. php 

=> ~index.php' 
Resolving linuxcommand.org... 66.35.250.210 
Connecting to linuxcommand.org|66.35.250.210|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: unspecified [text/html] 


[ <=> ] 3,120 --.--K/s 


11:02:51 (161.75 MB/s) - ~index.php' saved [3120] 


The program’s many options allow wget to recursively download, down- 
load files in the background (allowing you to log off but continue down- 
loading), and complete the download of a partially downloaded file. These 
features are well documented in its better-than-average man page. 
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For many years, Unix-like operating systems have had the capability to be 
administered remotely via a network. In the early days, before the general 
adoption of the Internet, there were a couple of popular programs used to 
log in to remote hosts. These were the rlogin and telnet programs. These 
programs, however, suffer from the same fatal flaw that the ftp program 
does; they transmit all their communications (including login names and 
passwords) in cleartext. This makes them wholly inappropriate for use in 
the Internet Age. 


ssh 


To address this problem, a new protocol called Secure Shell (SSH) was 
developed. SSH solves the two basic problems of secure communication 
with a remote host. 


e It authenticates that the remote host is who it says it is (thus preventing 
so-called man-in-the-middle attacks). 


e It encrypts all of the communications between the local and remote 
hosts. 


SSH consists of two parts. An SSH server runs on the remote host, listen- 
ing for incoming connections, by default, on port 22, while an SSH client is 
used on the local system to communicate with the remote server. 


Most Linux distributions ship an implementation of SSH called OpenSsH 
from the OpenBSD project. Some distributions include both the client and 
the server packages by default (for example, Red Hat), while others (such as 
Ubuntu) supply only the client. To enable a system to receive remote connec- 
tions, it must have the OpenSSH-server package installed, configured, and run- 
ning, and (if the system either is running or is behind a firewall) it must allow 
incoming network connections on TCP port 22. 


TIP If you don’t have a remote system to connect to but want to try these examples, make 
sure the OpenSSH-server package is installed on your system and use localhost as 
the name of the remote host. That way, your machine will create network connections 
with itself. 


The SSH client program used to connect to remote SSH servers is 
called, appropriately enough, ssh. To connect to a remote host named 
remote-sys, we would use the ssh client program like so: 


[me@linuxbox ~]$ ssh remote-sys 

The authenticity of host 'remote-sys (192.168.1.4)' can't be established. 
RSA key fingerprint is 41:ed:7a:df:23:19:bf:3c:a5:17:bc:61:b3:7f:d9:bb. 
Are you sure you want to continue connecting (yes/no)? 


The first time the connection is attempted, a message is displayed indi- 
cating that the authenticity of the remote host cannot be established. This 
is because the client program has never seen this remote host before. To 
accept the credentials of the remote host, enter yes when prompted. Once 
the connection is established, the user is prompted for a password. 


Warning: Permanently added 'remote-sys,192.168.1.4' (RSA) to the list of known hosts. 
me@remote-sys's password: 


After the password is successfully entered, we receive the shell prompt 
from the remote system. 


Last login: Sat Aug 25 13:00:48 2018 
[me@remote-sys ~]$ 


The remote shell session continues until the user enters the exit com- 
mand at the remote shell prompt, thereby closing the remote connection. 
At this point, the local shell session resumes, and the local shell prompt 
reappears. 

It is also possible to connect to remote systems using a different username. 
For example, if the local user me had an account named bob on a remote sys- 
tem, user me could log in to the account bob on the remote system as follows: 


[me@linuxbox ~]$ ssh bob@remote-sys 
bob@remote-sys's password: 

Last login: Sat Aug 25 13:03:21 2018 
[bob@remote-sys ~]$ 
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As stated earlier, ssh verifies the authenticity of the remote host. If 
the remote host does not successfully authenticate, the following message 
appears: 


[me@linuxbox ~]$ ssh remote-sys 
claaadaadadcadadcadcadaacdadcadcdcadcadadcdadcadaacadqadadcad 

@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ 
laaadaadadcadadcdadcadaacdadcadadcadcadadcadcqdaacadqacdadcad 

IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! 

Someone could be eavesdropping on you right now (man-in-the-middle attack)! 
It is also possible that the RSA host key has just been changed. 

The fingerprint for the RSA key sent by the remote host is 
41:ed:7a:df:23:19:bf:3c:a5:17:bc:61:b3:7:d9:bb. 

Please contact your system administrator. 

Add correct host key in /home/me/.ssh/known_hosts to get rid of this message. 
Offending key in /home/me/.ssh/known_hosts:1 

RSA host key for remote-sys has changed and you have requested strict 
checking. 

Host key verification failed. 


This message is caused by one of two possible situations. First, an attacker 
may be attempting a man-in-the-middle attack. This is rare because every- 
body knows that ssh alerts the user to this. The more likely culprit is that the 
remote system has been changed somehow; for example, its operating system 
or SSH server has been reinstalled. In the interests of security and safety, 
however, the first possibility should not be dismissed out of hand. Always 
check with the administrator of the remote system when this message occurs. 

After it has been determined that the message is because of a benign 
cause, it is safe to correct the problem on the client side. This is done by using 
a text editor (vim perhaps) to remove the obsolete key from the ~/ssh/known_ 
hosts file. In the preceding example message, we see this: 


Offending key in /home/me/.ssh/known_hosts:1 


This means that the first line of the known_hosts file contains the 
offending key. Delete this line from the file, and the ssh program will be 
able to accept new authentication credentials from the remote system. 

Besides opening a shell session on a remote system, ssh allows us to 
execute a single command on a remote system. For example, to execute the 
free command on a remote host named remote-sys and have the results dis- 
played on the local system, use this: 


[me@linuxbox ~]$ ssh remote-sys free 
me@twin4's password: 


total used free shared buffers cached 
Mem: 775536 507184 268352 0 110068 154596 
-/+ buffers/cache: 242520 533016 
Swap: 1572856 0) 1572856 


[me@linuxbox ~]$ 


It’s possible to use this technique in more interesting ways, such as the 
following example in which we perform an 1s on the remote system and 
redirect the output to a file on the local system: 


[me@linuxbox ~]$ ssh remote-sys ‘ls *' > dirlist.txt 
me@twin4's password: 
[me@linuxbox ~]$ 


Notice the use of the single quotes in the preceding command. This is 
done because we do not want the pathname expansion performed on the 
local machine; rather, we want it to be performed on the remote system. 
Likewise, if we had wanted the output redirected to a file on the remote 
machine, we could have placed the redirection operator and the filename 
within the single quotes. 


me@linuxbox ~]$ ssh remote-sys ‘ls * > dirlist.txt' 
[ y 


TUNNELING WITH SSH 


Part of what happens when you establish a connection with a remote host via 
SSH is that an encrypted tunnel is created between the local and remote systems. 
Normally, this tunnel is used to allow commands typed at the local system to be 
transmitted safely to the remote system and for the results to be transmitted safely 
back. In addition to this basic function, the SSH protocol allows most types of 
network traffic to be sent through the encrypted tunnel, creating a sort of virtual 
private network (VPN) between the local and remote systems. 

Perhaps the most common use of this feature is to allow X Window 
system traffic to be transmitted. On a system running an X server (that is, a 
machine displaying a GUI), it is possible to launch and run an X client program 
(a graphical application) on a remote system and have its display appear on 
the local system. It’s easy to do; here’s an example. Let’s say we are sitting at 
a Linux system called linuxbox that is running an X server and we want to run 
the xload program on a remote system named remote-sys to see the program's 
graphical output on our local system. We could do this: 


[me@linuxbox ~]$ ssh -X remote-sys 
me@remote-sys's password: 

Last login: Mon Sep 10 13:23:11 2018 
[me@remote-sys ~]$ xload 


After the xload command is executed on the remote system, its window 
appears on the local system. On some systems, you may need to use the -Y 


option rather than the -X option to do this. 
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scp and sftp 


The OpenSSH package also includes two programs that can make use of 

an SSH-encrypted tunnel to copy files across the network. The first, scp 
(secure copy), is used much like the familiar cp program to copy files. The 
most notable difference is that the source or destination pathnames may 
be preceded with the name of a remote host, followed by a colon char- 
acter. For example, if we wanted to copy a document named document.ixt 
from our home directory on the remote system, remote-sys, to the current 
working directory on our local system, we could do this: 


[me@linuxbox ~]$ scp remote-sys:document.txt . 

me@remote-sys's password: 

document. txt 100% 5581 5.5KB/s 00:00 
[me@linuxbox ~]$ 


As with ssh, you may apply a username to the beginning of the remote 
host’s name if the desired remote host account name does not match that of 
the local system. 


[me@linuxbox ~]$ scp bob@remote-sys:document.txt . 


The second SSH file-copying program is sftp, which, as its name 
implies, is a secure replacement for the ftp program. sftp works much 
like the original ftp program that we used earlier; however, instead of 
transmitting everything in cleartext, it uses an SSH encrypted tunnel. 
sftp has an important advantage over conventional ftp in that it does not 
require an FTP server to be running on the remote host. It requires only 
the SSH server. This means that any remote machine that can connect 
with the SSH client can also be used as an FTP-like server. Here is a 
sample session. 


[me@linuxbox ~]$ sftp remote-sys 

Connecting to remote-sys... 

me@remote-sys's password: 

sftp> 1s 

ubuntu-8.04-desktop-i386.iso 

sftp> lcd Desktop 

sftp> get ubuntu-8.04-desktop-i386.iso 

Fetching /home/me/ubuntu-8.04-desktop-i386.iso to ubuntu-8.04-desktop-i386.iso 
/home/me/ubuntu-8.04-desktop-i386.iso 100% 699MB 7.4MB/s 01:35 

sftp> bye 


The SFTP protocol is supported by many of the graphical file managers found in 
Linux distributions. Using either GNOME or KDE, we can enter a URI beginning 
with sftp:// into the location bar and operate on files stored on a remote system run- 
ning an SSH server. 


AN SSH CLIENT FOR WINDOWS? 


Let's say you are sitting ata Windows machine but you need to log in to your 
Linux server and get some real work done; what do you do? Get an SSH client 
program for your Windows box, of course! There are a number of these. The 


most popular one is probably PuTTY by Simon Tatham and his team. The PuTTY 
program displays a terminal window and allows a Windows user to open an 
SSH (or telnet) session on a remote host. The program also provides analogs 
for the scp and sftp programs. 

PuTTY is available at www.chiark.greenend.org.uk/~sgtatham/putty/. 


Summing Up 


In this chapter, we surveyed the field of networking tools found on most 
Linux systems. Since Linux is so widely used in servers and networking 
appliances, there are many more that can be added by installing additional 
software. But even with the basic set of tools, it is possible to perform many 
useful network-related tasks. 
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SEARCHING FOR FILES 


As we have wandered around our Linux 
system, one thing has become abundantly 
clear: a typical Linux system has a lot of 
files! This raises the question, “How do we find 
things?” We already know that the Linux file system is 
well organized according to conventions passed down 


from one generation of Unix-like systems to the next, but the sheer number 
of files can present a daunting problem. 

In this chapter, we will look at two tools that are used to find files on a 
system. 


locate Find files by name 


find Search for files in a directory hierarchy 


We will also look at a command that is often used with file-search com- 
mands to process the resulting list of files. 


xargs Build and execute command lines from standard input 
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In addition, we will introduce a couple of commands to assist us in our 
explorations. 


touch Change file times 
stat Display file or file system status 


locate—Find Files the Easy Way 


Chapter 17 


The locate program performs a rapid database search of pathnames and 
then outputs every name that matches a given substring. Say, for example, we 
want to find all the programs with names that begin with zip. Because we are 
looking for programs, we can assume that the name of the directory contain- 
ing the programs would end with bin/ Therefore, we could try to use locate 
this way to find our files: 


[me@linuxbox ~]$ locate bin/zip 


locate will search its database of pathnames and output any that con- 
tain the string bin/zip. 


/usr/bin/zip 

/usr/bin/zipcloak 
/usr/bin/zipgrep 
/usr/bin/zipinfo/usr/bin/zipnote 
/usr/bin/zipsplit 


If the search requirement is not so simple, we can combine locate with 
other tools such as grep to design more interesting searches. 


[me@linuxbox ~]$ locate zip | grep bin 
/bin/bunzip2 
/bin/bzip2 
/bin/bzip2recover 
/bin/gunzip 
/bin/gzip 
/usr/bin/funzip 
/usr/bin/gpg-zip 
/usr/bin/preunzip 
/usr/bin/prezip 
/usr/bin/prezip-bin 
/usr/bin/unzip 
/usr/bin/unzipsfx 
/usr/bin/zip 
/usr/bin/zipcloak 
/usr/bin/zipgrep 
/usr/bin/zipinfo 
/usr/bin/zipnote 
/usr/bin/zipsplit 


The locate program has been around for a number of years, and there 
are several variants in common use. The two most common ones found in 
modern Linux distributions are slocate and mlocate, though they are usually 
accessed by a symbolic link named locate. The different versions of locate 
have overlapping options sets. Some versions include regular expression 
matching (which we’ll cover in Chapter 19) and wildcard support. Check 
the man page for locate to determine which version of locate is installed. 


WHERE DOES THE LOCATE DATABASE COME FROM? 


You might notice that, on some distributions, locate fails to work just after the 
system is installed, but if you try again the next day, it works fine. What gives? 
The locate database is created by another program named updatedb. Usually, 
it is run periodically as a cron job, that is, a task performed at regular intervals 
by the cron daemon. Most systems equipped with locate run updatedb once 

a day. Because the database is not updated continuously, you will notice that 
very recent files do not show up when using locate. To overcome this, it’s pos- 


sible to run the updatedb program manually by becoming the superuser and 


running updatedb at the prompt. 


find—Find Files the Hard Way 


While the locate program can find a file based solely on its name, the find 
program searches a given directory (and its subdirectories) for files based 
on a variety of attributes. We’re going to spend a lot of time with find 
because it has a lot of interesting features that we will see again and again 
when we start to cover programming concepts in later chapters. 

In its simplest use, find is given one or more names of directories to 
search. For example, to produce a listing of our home directory, we can 
use this: 


[me@linuxbox ~]$ find ~ 


On most active user accounts, this will produce a large list. Because 
the list is sent to standard output, we can pipe the list into other programs. 
Let’s use wc to count the number of files. 


[me@linuxbox ~]$ find ~ | we -1 
47068 


Wow, we’ve been busy! The beauty of find is that it can be used to 
identify files that meet specific criteria. It does this through the (slightly 
strange) application of options, tests, and actions. We'll look at the tests first. 
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Tests 


Let’s say we want a list of directories from our search. To do this, we could 
add the following test: 


[me@linuxbox ~]$ find ~ -type d | we -1l 
1695 


Adding the test -type d limited the search to directories. Conversely, we 
could have limited the search to regular files with this test: 


[me@linuxbox ~]$ find ~ -type f | we -1 
38737 


Table 17-1 lists the common file type tests supported by find. 


Table 17-1: find File Types 
Filetype Description 


b Block special device file 

c Character special device file 
d Directory 

f Regular file 

1 Symbolic link 


We can also search by file size and filename by adding some additional 
tests. Let’s look for all the regular files that match the wildcard pattern 
*.JPG and are larger than one megabyte. 


[me@linuxbox ~]$ find ~ -type f -name "*.JPG" -size +1M | we -l 
840 


In this example, we add the -name test followed by the wildcard pattern. 
Notice how we enclose it in quotes to prevent pathname expansion by the 
shell. Next, we add the -size test followed by the string +1M. The leading plus 
sign indicates that we are looking for files larger than the specified number. 
A leading minus sign would change the meaning of the string to be smaller 
than the specified number. Using no sign means “match the value exactly.” 
The trailing letter M indicates that the unit of measurement is megabytes. 
Table 17-2 lists the characters that can be used to specify units. 


Table 17-2: find Size Units 
Character Unit 


b 512-byte blocks. This is the default if no unit is specified. 
c Bytes. 
Ww 2-byte words. 


Character Unit 


k Kilobytes (units of 1,024 bytes). 
M Megabytes (units of 1,048,576 bytes). 
G Gigabytes (units of 1,073,741,824 bytes). 


find supports a large number of tests. Table 17-3 provides a rundown 
of the common ones. Note that in cases where a numeric argument is 
required, the same + and - notation discussed previously can be applied. 


Table 17-3: find Tests 


Test 


-cmin n 


-cnewer file 


-ctime n 


-empty 


-group name 


-iname pattern 


-inum n 


-mmin n 


-mtime n 


-name pattern 


-newer file 


-nouser 


-nogroup 


-perm mode 


Description 


Match files or directories whose content or attributes were last modi- 
fied exactly n minutes ago. To specify less than n minutes ago, use -n, 
and to specify more than n minutes ago, use +n. 


Match files or directories whose contents or attributes were last 
modified more recently than those of file. 


Match files or directories whose contents or attributes were last 
modified n*24 hours ago. 


Match empty files and directories. 


Match file or directories belonging to group name. name may be 
expressed either as a group name or as a numeric group ID. 


Like the -name test but case-insensitive. 


Match files with inode number n. This is helpful for finding all the 
hard links to a particular inode. 

Match files or directories whose contents were last modified 

n minutes ago. 

Match files or directories whose contents were last modified 

n*24 hours ago. 

Match files and directories with the specified wildcard pattern. 
Match files and directories whose contents were modified more 
recently than the specified file. This is useful when writing shell 
scripts that perform file backups. Each time you make a backup, 
update a file (such as a log) and then use find to determine which 
files have changed since the last update. 

Match file and directories that do not belong to a valid user. This 
can be used to find files belonging to deleted accounts or to detect 
activity by attackers. 

Match files and directories that do not belong to a valid group. 
Match files or directories that have permissions set to the specified 
mode. mode can be expressed by either octal or symbolic notation. 


(continued) 
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Table 17-3: find Tests (continued) 


Test Description 


-samefile name Similar to the -inum test. Match files that share the same inode 
number as file name. 


-size n Match files of size n. 
-type c Match files of type c. 
-user name Match files or directories belonging to user name. The user may be 


expressed by a username or by a numeric user ID. 


This is not a complete list. The find man page has all the details. 


Operators 


Even with all the tests that find provides, we might still need a better way to 
describe the logical relationships between the tests. For example, what if we 
needed to determine whether all the files and subdirectories in a directory 
had secure permissions? We would look for all the files with permissions 
that are not 0600 and the directories with permissions that are not 0700. 
Fortunately, find provides a way to combine tests using logical operators to 
create more complex logical relationships. To express the aforementioned 
test, we could do this: 


[me@linuxbox ~]$ find ~ \( -type f -not -perm 0600 \) -or \( -type d -not -perm 0700 \) 


Yikes! That sure looks weird. What is all this stuff? Actually, the operators 
are not that complicated once you get to know them. Table 17-4 describes the 
logical operators used with find. 


Table 17-4: find Logical Operators 


Operator 


-and 


Description 


Match if the tests on both sides of the operator are true. This can be shortened to -a. Note 


-or 


-not 


() 
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that when no operator is present, -and is implied by default. 
Match if a test on either side of the operator is true. This can be shortened to -o. 


Match if the test following the operator is false. This can be abbreviated with an exclama- 
tion point (!). 

Group tests and operators together to form larger expressions. This is used to control the 
precedence of the logical evaluations. By default, find evaluates from left to right. It is often 
necessary to override the default evaluation order to obtain the desired result. Even if not 
needed, it is helpful sometimes to include the grouping characters to improve the readability 
of the command. Note that since the parentheses have special meaning to the shell, they 
must be quoted when using them on the command line to allow them to be passed as argu- 
ments to find. Usually the backslash character is used to escape them. 


With this list of operators in hand, let’s deconstruct our find command. 
When viewed from the uppermost level, we see that our tests are arranged 
as two groupings separated by an -or operator. 


( expression 1 ) -or ( expression 2 ) 


This makes sense because we are searching for files with a certain 
set of permissions and for directories with a different set. If we are look- 
ing for both files and directories, why do we use -or instead of -and? As 
find scans through the files and directories, each one is evaluated to see 
whether it matches the specified tests. We want to know whether it is either 
a file with bad permissions ora directory with bad permissions. It can’t be 
both at the same time. So if we expand the grouped expressions, we can 
see it this way: 


( file with bad perms ) -or ( directory with bad perms ) 


Our next challenge is how to test for “bad permissions.” How do we 
do that? Actually, we don’t. What we will test for is “not good permissions” 
because we know what “good permissions” are. In the case of files, we define 
good as 0600, and for directories, we define it as 0700. The expression that 
will test files for “not good” permissions is as follows: 


-type f -and -not -perms 0600 


For directories it is as follows: 


-type d -and -not -perms 0700 


As noted in Table 17-4, the -and operator can be safely removed because 
it is implied by default. So if we put this all back together, we get our final 
command. 


find ~ ( -type f -not -perms 0600 ) -or ( -type d -not -perms 0700 ) 


However, because the parentheses have special meaning to the shell, 
we must escape them to prevent the shell from trying to interpret them. 
Preceding each one with a backslash character does the trick. 

There is another feature of logical operators that is important to 
understand. Let’s say that we have two expressions separated by a logical 
operator. 


expr1 -operator expr2 


In all cases, expr1 will always be performed; however, the operator will 
determine whether expr2 is performed. Table 17-5 outlines how it works. 
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Table 17-5: find AND/OR Logic 


Results of expr1 Operator = expr2is... 


True -and Always performed 
False -and Never performed 
True -or Never performed 
False -or Always performed 


Why does this happen? It’s done to improve performance. Take -and, for 
example. We know that the expression expr1 -and expr2 cannot be true if the 
result of expr1 is false, so there is no point in performing expr2. Likewise, if 
we have the expression expr1 -or expr2 and the result of expr1 is true, there 
is no point in performing expr2, as we already know that the expression 
expr1 -or expr2 is true. 

OK, so it helps it go faster. Why is this important? It’s important because 
we can rely on this behavior to control how actions are performed, as we will 
soon see, 


Predefined Actions 


Let’s get some work done! Having a list of results from our find command 

is useful, but what we really want to do is act on the items on the list. For- 
tunately, find allows actions to be performed based on the search results. 
There are a set of predefined actions and several ways to apply user-defined 
actions. First, let’s look at a few of the predefined actions listed in Table 17-6. 


Table 17-6: Predefined find Actions 


Action Description 
-delete Delete the currently matching file. 
-ls Perform the equivalent of 1s -dils on the matching file. Output is sent to 


standard output. 


-print Output the full pathname of the matching file to standard output. This is 
the default action if no other action is specified. 


-quit Quit once a match has been made. 


As with the tests, there are many more actions. See the find man page 
for full details. 
In the first example, we did this: 


find ~ 


This produced a list of every file and subdirectory contained within our 
home directory. It produced a list because the -print action is implied if 
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no other action is specified. Thus, our command could also be expressed 
as follows: 


find ~ -print 


We can use find to delete files that meet certain criteria. For example, 
to delete files that have the file extension .bak (which is often used to desig- 
nate backup files), we could use this command: 


find ~ -type f -name '*.bak' -delete 


In this example, every file in the user’s home directory (and its subdi- 
rectories) is searched for filenames ending in .bak. When they are found, 
they are deleted. 


It should go without saying that you should use extreme caution when using 
the -delete action. Always test the command first by substituting the -print action 
for -delete to confirm the search results. 


Before we go on, let’s take another look at how the logical operators 
affect actions. Consider the following command: 


find ~ -type f -name '*.bak' -print 


As we have seen, this command will look for every regular file (-type f) 
whose name ends with .bak (-name '*.bak') and will output the relative path- 
name of each matching file to standard output (-print). However, the reason 
the command performs the way it does is determined by the logical relation- 
ships between each of the tests and actions. Remember, there is, by default, 
an implied -and relationship between each test and action. We could also 
express the command this way to make the logical relationships easier to see: 


find ~ -type f -and -name '*.bak' -and -print 


With our command fully expressed, let’s look at how the logical opera- 
tors affect its execution: 


Test/Action Is performed only if... 

-print -type f and -name '*.bak' are true 

-name '*.bak' —_-type f is true 

-type f Is always performed, since it is the first test/action in an -and 
relationship. 


Because the logical relationship between the tests and actions deter- 
mines which of them are performed, we can see that the order of the tests 
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and actions is important. For instance, if we were to reorder the tests and 
actions so that the -print action was the first one, the command would 
behave much differently. 


find ~ -print -and -type f -and -name '*.bak’' 


This version of the command will print each file (the -print action always 
evaluates to true) and then test for file type and the specified file extension. 


User-Defined Actions 


In addition to the predefined actions, we can also invoke arbitrary com- 
mands. The traditional way of doing this is with the -exec action. This 
action works like this: 


-exec command {} ; 


Here, command is the name of a command, {} is a symbolic representation 
of the current pathname, and the semicolon is a required delimiter indicat- 
ing the end of the command. Here’s an example of using -exec to act like 
the -delete action discussed earlier: 


-exec rm ‘{}' ';' 


Again, because the brace and semicolon characters have special mean- 
ing to the shell, they must be quoted or escaped. 

It’s also possible to execute a user-defined action interactively. By using 
the -ok action in place of -exec, the user is prompted before execution of 
each specified command. 


find ~ -type f -name 'foo*' -ok 1s -1 '{}' ';' 

< ls ... /home/me/bin/foo > ? y 

-YWXY-XY-X 1 me me 224 2007-10-29 18:44 /home/me/bin/foo 
< ls ... /home/me/foo.txt > ? y 

-YW-Y--Y-- 1 me me  O 2016-09-19 12:53 /home/me/foo.txt 


In this example, we search for files with names starting with the string 
foo and execute the command Is -1 each time one is found. Using the -ok 
action prompts the user before the 1s command is executed. 


Improving Efficiency 
When the -exec action is used, it launches a new instance of the specified 
command each time a matching file is found. There are times when we might 


prefer to combine all of the search results and launch a single instance of the 
command. For example, rather than executing the commands like this: 


ls -1 file1 
Is -1 file2 


we may prefer to execute them this way: 


ls -1 file1 file2 


This causes the command to be executed only one time rather than 
multiple times. There are two ways we can do this: the traditional way, using 
the external command xargs, and the alternate way, using a new feature in 
find itself. We'll talk about the alternate way first. 

By changing the trailing semicolon character to a plus sign, we activate 
the capability of find to combine the results of the search into an argument 
list for a single execution of the desired command. Returning to our example, 
this will execute 1s each time a matching file is found: 


find ~ -type f -name 'foo*' -exec 1s -1 '{}' ';' 
-YWXY-XY-X 1 me me 224 2007-10-29 18:44 /home/me/bin/foo 
-YW-Y--Y-- 1 me me  O 2016-09-19 12:53 /home/me/foo.txt 


By changing the command to the following: 


find ~ -type f -name 'foo*' -exec Is -1 '{}' + 
-YWXY-XY-X 1 me me 224 2007-10-29 18:44 /home/me/bin/foo 
-YW-Y--Y-- 1 me me  O 2016-09-19 12:53 /home/me/foo.txt 


we get the same results, but the system has to execute the 1s command 
only once. 


xargs 


The xargs command performs an interesting function. It accepts input from 
standard input and converts it into an argument list for a specified command. 
With our example, we would use it like this: 


find ~ -type f -name 'foo*' -print | xargs 1s -1 
-YWXY-XY-X 1 me me 224 2007-10-29 18:44 /home/me/bin/foo 
-YW-Y--Y-- 1 me me  O 2016-09-19 12:53 /home/me/foo.txt 


Here we see the output of the find command piped into xargs, which, 
in turn, constructs an argument list for the 1s command and then exe- 
cutes it. 


While the number of arguments that can be placed into a command line is quite 
large, it’s not unlimited. It is possible to create commands that are too long for the 
shell to accept. When a command line exceeds the maximum length supported by 
the system, xargs executes the specified command with the maximum number of 
arguments possible and then repeats this process until standard input is exhausted. 
To see the maximum size of the command line, execute xargs with the --show-limits 
option. 
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DEALING WITH FUNNY FILENAMES 


Unix-like systems allow embedded spaces (and even newlines!) in filenames. 
This causes problems for programs like xargs that construct argument lists for 
other programs. An embedded space will be treated as a delimiter, and the 
resulting command will interpret each space-separated word as a separate 
argument. To overcome this, find and xargs allow the optional use of a null 
character as an argument separator. A null character is defined in ASCII as the 
character represented by the number zero (as opposed to, for example, the 


space character, which is defined in ASCII as the character represented by the 
number 32). The find command provides the action -printo, which produces 
null-separated output, and the xargs command has the --null (or -0) option, 
which accepts null-separated input. Here’s an example: 


find ~ -iname '*.jpg' -printo | xargs --null ls -1l 


Using this technique, we can ensure that all files, even those containing 
embedded spaces in their names, are handled correctly. 


A Return to the Playground 


It’s time to put find to some (almost) practical use. We’ll create a play- 
ground and try some of what we have learned. 
First, let’s create a playground with lots of subdirectories and files. 


[me@linuxbox ~]$ mkdir -p playground/dir-{001..100} 
[me@linuxbox ~]$ touch playground/dir-{001..100}/file-{A. .Z} 


Marvel at the power of the command line! With these two lines, we cre- 
ated a playground directory containing 100 subdirectories each containing 
26 empty files. Try that with the GUI! 

The method we employed to accomplish this magic involved a familiar 
command (mkdir), an exotic shell expansion (braces), and a new command, 
touch. By combining mkdir with the -p option (which causes mkdir to create 
the parent directories of the specified paths) with brace expansion, we were 
able to create 100 subdirectories. 

The touch command is usually used to set or update the access, change, 
and modify times of files. However, if a filename argument is that of a non- 
existent file, an empty file is created. 

In our playground, we created 100 instances of file-A. Let’s find them. 


[me@linuxbox ~]$ find playground -type f -name 'file-A' 


Note that unlike 1s, find does not produce results in sorted order. Its 
order is determined by the layout of the storage device. We can confirm 
that we actually have 100 instances of the file this way. 


[me@linuxbox ~]$ find playground -type f -name 'file-A' | we -1 
100 


Next, let’s look at finding files based on their modification times. This 
will be helpful when creating backups or organizing files in chronological 
order. To do this, we will first create a reference file against which we will 
compare modification times. 


[me@linuxbox ~]$ touch playground/timestamp 


This creates an empty file named timestamp and sets its modification 
time to the current time. We can verify this by using another handy com- 
mand, stat, which is a kind of souped-up version of 1s. The stat command 
reveals all that the system understands about a file and its attributes. 


[me@linuxbox ~]$ stat playground/timestamp 

File: ~playground/timestamp' 

Size: 0 Blocks: 0 IO Block: 4096 regular empty file 
Device: 803h/2051d Inode: 14265061 Links: 1 
Access: (0644/-rw-r--r--) Uid: ( 1001/ me) Gid: ( 1001/ me) 
Access: 2018-10-08 15:15:39.000000000 -0400 
Modify: 2018-10-08 15:15:39.000000000 -0400 
Change: 2018-10-08 15:15:39.000000000 -0400 


If we use touch again and then examine the file with stat, we will see 
that the file’s times have been updated. 


[me@linuxbox ~]$ touch playground/timestamp 
[me@linuxbox ~]$ stat playground/timestamp 
File: ~playground/timestamp' 
Size: 0 Blocks: 0 IO Block: 4096 regular empty file 
Device: 803h/2051d Inode: 14265061 Links: 1 
Access: (0644/-rw-r--r--) Uid: ( 1001/ me) Gid: ( 1001/ me) 
Access: 2018-10-08 15:23:33.000000000 -0400 
Modify: 2018-10-08 15:23:33.000000000 -0400 
Change: 2018-10-08 15:23:33.000000000 -0400 


Next, let’s use find to update some of our playground files. 


[me@linuxbox ~]$ find playground -type f -name 'file-B' -exec touch '{}' '; 


This updates all playground files named /ile-B. We'll use find to identify 
the updated files by comparing all the files to the reference file timestamp. 


[me@linuxbox ~]$ find playground -type f -newer playground/timestamp 


The results contain all 100 instances of file-B. Since we performed 
a touch on all the files in the playground named /ile-B after we updated 
timestamp, they are now “newer” than timestamp and thus can be identified 
with the -newer test. 
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Finally, let’s go back to the bad permissions test we performed earlier 
and apply it to playground. 


[me@linuxbox ~]$ find playground \( -type f -not -perm 0600 \) -or \( -type d 
-not -perm 0700 \) 


This command lists all 100 directories and 2,600 files in playground (as 
well as timestamp and playground itself, for a total of 2,602) because none of 
them meets our definition of “good permissions.” With our knowledge of 
operators and actions, we can add actions to this command to apply new 
permissions to the files and directories in our playground. 


[me@linuxbox ~]$ find playground \( -type f -not -perm 0600 -exec chmod 0600 
"{}' '5' \) -or \( -type d -not -perm 0700 -exec chmod 0700 '{}' ';' \) 


On a day-to-day basis, we might find it easier to issue two commands, 
one for the directories and one for the files, rather than this one large 
compound command, but it’s nice to know that we can do it this way. The 
important point here is to understand how the operators and actions can 
be used together to perform useful tasks. 


find Options 


Finally, we have the options, which are used to control the scope of a find 
search. They may be included with other tests and actions when construct- 
ing find expressions. Table 17-7 lists the most commonly used find options. 


Table 17-7: Commonly Used find Options 


Option 
-depth 


-maxdepth levels 


-mindepth levels 


-mount 


-noleaf 
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Description 

Direct find to process a directory’s files before the directory itself. This option is auto- 
matically applied when the -delete action is specified. 

Set the maximum number of levels that find will descend into a directory tree when 
performing tests and actions. 

Set the minimum number of levels that find will descend into a directory tree before 
applying tests and actions. 

Direct find not to traverse directories that are mounted on other file systems. 

Direct find not to optimize its search based on the assumption that it is searching a 


Unix-like file system. This is needed when scanning DOS/Windows file systems and 
CD-ROMs. 


Summing Up 


Chapter 17 


It’s easy to see that locate is as simple as find is complicated. They both 
have their uses. Take the time to explore the many features of find. It 
can, with regular use, improve your understanding of Linux file system 
operations. 


ARCHIVING AND BACKUP 


One of the primary tasks of a computer 
system’s administrator is keeping the sys- 


tem’s data secure. One way this is done is by 

performing timely backups of the system’s files. 
Even if you're not a system administrator, it is often 
useful to make copies of things and move large collec- 
tions of files from place to place and from device to 
device. 


In this chapter, we will look at several common programs that are used 
to manage collections of files. These are the file compression programs: 


gzip Compress or expand files 


bzip2 A block sorting file compressor 
These are the archiving programs: 


tar Tape archiving utility 


zip Package and compress files 
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This is the file synchronization program: 


rsync Remote file and directory synchronization 


Compressing Files 


Chapter 18 


Throughout the history of computing, there has been a struggle to get the 
most data into the smallest available space, whether that space be memory, 
storage devices, or network bandwidth. Many of the data services that we 
take for granted today, such as mobile phone service, high-definition televi- 
sion, or broadband Internet, owe their existence to effective data compression 
techniques. 

Data compression is the process of removing redundancy from data. 
Let’s consider an imaginary example. Suppose we had an entirely black 
picture file with the dimensions of 100 pixels by 100 pixels. In terms of 
data storage (assuming 24 bits, or 3 bytes per pixel), the image will occupy 
30,000 bytes of storage. 


100 x 100 x 3 = 30,000 


An image that is all one color contains entirely redundant data. If we 
were clever, we could encode the data in such a way that we simply describe 
the fact that we have a block of 10,000 black pixels. So, instead of storing 
a block of data containing 30,000 zeros (black is usually represented in 
image files as zero), we could compress the data into the number 10,000, 
followed by a zero to represent our data. Such a data compression scheme is 
called run-length encoding and is one of the most rudimentary compression 
techniques. Today’s techniques are much more advanced and complex, but 
the basic goal remains the same—¢get rid of redundant data. Compression 
algorithms (the mathematical techniques used to carry out the compression) 
fall into two general categories. 


e Lossless. Lossless compression preserves all the data contained in the 
original. This means that when a file is restored from a compressed ver- 
sion, the restored file is exactly the same as the original, uncompressed 
version. 


e Lossy. Lossy compression, on the other hand, removes data as the com- 
pression is performed to allow more compression to be applied. When 
a lossy file is restored, it does not match the original version; rather, it 
is a close approximation. Examples of lossy compression are JPEG (for 
images) and MP3 (for music). 


In our discussion, we will look exclusively at lossless compression since 
most data on computers cannot tolerate any data loss. 


gzip 
The gzip program is used to compress one or more files. When executed, 
it replaces the original file with a compressed version of the original. The 


corresponding gunzip program is used to restore compressed files to their 
original, uncompressed form. Here is an example: 


[me@linuxbox ~]$ 1s -1 /etc > foo.txt 

[me@linuxbox ~]$ ls -1 foo.* 

-Iw-r--r-- 1 me me 15738 2018-10-14 07:15 foo.txt 
[me@linuxbox ~]$ gzip foo.txt 

[me@linuxbox ~]$ ls -1 foo.* 

-IW-r--r-- 1 me me 3230 2018-10-14 07:15 foo.txt.gz 
[me@linuxbox ~]$ gunzip foo.txt 

[me@linuxbox ~]$ ls -1 foo.* 

-IYw-r--r-- 1 me me 15738 2018-10-14 07:15 foo.txt 


In this example, we create a text file named /oo.txt from a directory list- 
ing. Next, we run gzip, which replaces the original file with a compressed ver- 
sion named /foo.ixt.gz. In the directory listing of foo.*, we see that the original 
file has been replaced with the compressed version and that the compressed 
version is about one-fifth the size of the original. We can also see that the 
compressed file has the same permissions and timestamp as the original. 

Next, we run the gunzip program to uncompress the file. Afterward, we 
can see that the compressed version of the file has been replaced with the 
original, again with the permissions and timestamp preserved. 

gzip has many options, as described in Table 18-1. 


Table 18-1: gzip Options 


Option Long option _ Description 


=€ --stdout Write output to standard output and keep the original files. 
--to-stdout 
-d --decompress = Decompress. This causes gzip to act like gunzip. 
--uncompress 
=F -- force Force compression even if a compressed version of the 
original file already exists. 
-h --help Display usage information. 
=i --list List compression statistics for each file compressed. 
-r --recursive If one or more arguments on the command line is a direc- 
tory, recursively compress files contained within them. 
-t --test Test the integrity of a compressed file. 
-V --verbose Display verbose messages while compressing. 
-number Set amount of compression. number is an integer in the 


range of 1 (fastest, least compression) to 9 (slowest, most 
compression). The values 1 and 9 may also be expressed 
as --fast and --best, respectively. The default value is 6. 


Let’s return to our earlier example. 
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[me@linuxbox ~]$ gzip foo.txt 
[me@linuxbox ~]$ gzip -tv foo.txt.gz 
foo.txt.gz: OK 

[me@linuxbox ~]$ gzip -d foo.txt.gz 


Here, we replaced foo.txt with a compressed version named /o0.ixt.gz. 
Next, we tested the integrity of the compressed version, using the -t and 
-v options. Finally, we decompressed the file to its original form. 

gzip can also be used in interesting ways via standard input and output. 


[me@linuxbox ~]$ 1s -1 /etc | gzip > foo.txt.gz 


This command creates a compressed version of a directory listing. 

The gunzip program, which uncompresses gzip files, assumes that file- 
names end in the extension .gz, so it’s not necessary to specify it, as long as 
the specified name is not in conflict with an existing uncompressed file. 


[me@linuxbox ~]$ gunzip foo.txt 


If our goal were only to view the contents of a compressed text file, we 
could do this: 


[me@linuxbox ~]$ gunzip -c foo.txt | less 


Alternately, there is a program supplied with gzip, called zcat, that is 
equivalent to gunzip with the -c option. It can be used like the cat command 
on gzip-compressed files. 


[me@linuxbox ~]$ zcat foo.txt.gz | less 


There is a zless program, too. It performs the same function as the previous pipeline. 


bzip2 

The bzip2 program, by Julian Seward, is similar to gzip but uses a different 
compression algorithm that achieves higher levels of compression at the 
cost of compression speed. In most regards, it works in the same fashion as 
gzip. A file compressed with bzip2 is denoted with the extension .bz2. 


[me@linuxbox ~]$ 1s -1 /etc > foo.txt 

[me@linuxbox ~]$ ls -1 foo.txt 

-IYw-Y--r-- 1 me me 15738 2018-10-17 13:51 foo.txt 
[me@linuxbox ~]$ bzip2 foo.txt 

[me@linuxbox ~]$ ls -1 foo.txt.bz2 

-Irw-Y--r-- 1 me me 2792 2018-10-17 13:51 foo.txt.bz2 
[me@linuxbox ~]$ bunzip2 foo.txt.bz2 


As we can see, bzip2 can be used the same way as gzip. All the options 
(except for -r) that we discussed for gzip are also supported in bzip2. Note, 
however, that the compression-level option (-number) has a somewhat different 
meaning to bzip2. bzip2 comes with bunzip2 and bzcat for decompressing files. 

bzip2 also comes with the bzip2recover program, which will try to recover 
damaged .bz2 files. 


DON’T BE COMPRESSIVE COMPULSIVE 


| occasionally see people attempting to compress a file that has already been 
compressed with an effective compression algorithm by doing something like this: 


$ gzip picture. jpg 


Don't do it. You’re probably just wasting time and space! If you apply 


compression to a file that is already compressed, you will usually end up with 


a larger file. This is because all compression techniques involve some overhead 
that is added to the file to describe the compression. If you try to compress a 
file that already contains no redundant information, the compression will most 
often not result in any savings to offset the additional overhead. 


Archiving Files 


A common file-management task often used in conjunction with compres- 
sion is archiving. Archiving is the process of gathering up many files and 
bundling them together into a single large file. Archiving is often done as 
part of system backups. It is also used when old data is moved from a system 
to some type of long-term storage. 


far 


In the Unix-like world of software, the tar program is the classic tool for 
archiving files. Its name, short for tape archive, reveals its roots as a tool 
for making backup tapes. While it is still used for that traditional task, it 
is equally adept on other storage devices. We often see filenames that end 
with the extension .far or .igz, which indicate a “plain” tar archive and a 
gzipped archive, respectively. A tar archive can consist of a group of sepa- 
rate files, one or more directory hierarchies, or a mixture of both. The 
command syntax works like this: 


tar mode[options] pathname... 


Here, mode is one of the operating modes listed in Table 18-2 (only a 
partial list is shown here; see the tar man page for a complete list). 
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Table 18-2: tar Modes 
Mode Description 


c Create an archive from a list of files and/or directories. 
x Extract an archive. 

r Append specified pathnames to the end of an archive. 
t List the contents of an archive. 


tar uses a slightly odd way of expressing options, so we’ll need some 
examples to show how it works. First, let’s re-create our playground from 
the previous chapter. 


[me@linuxbox ~]$ mkdir -p playground/dir-{001..100} 
[me@linuxbox ~]$ touch playground/dir-{001..100}/file-{A. .Z} 


Next, let’s create a tar archive of the entire playground. 


[me@linuxbox ~]$ tar cf playground.tar playground 


This command creates a tar archive named playground.tar that contains 
the entire playground directory hierarchy. We can see that the mode and the 
f option, which is used to specify the name of the tar archive, may be joined 
together and do not require a leading dash. Note, however, that the mode 
must always be specified first, before any other option. 

To list the contents of the archive, we can do this: 


[me@linuxbox ~]$ tar tf playground.tar 


For a more detailed listing, we can add the v (verbose) option. 


[me@linuxbox ~]$ tar tvf playground.tar 


Now, let’s extract the playground in a new location. We will do this by 
creating a new directory named foo, changing the directory, and extracting 
the tar archive. 


me@linuxbox ~]$ mkdir foo 

me@linuxbox ~]$ cd foo 

me@linuxbox foo]$ tar xf ../playground.tar 
me@linuxbox foo]$ 1s 

playground 


l 
l 
l 
[ 


If we examine the contents of ~/foo/playground, we see that the archive 
was successfully installed, creating a precise reproduction of the original files. 
There is one caveat, however. Unless we are operating as the superuser, files 
and directories extracted from archives take on the ownership of the user 
performing the restoration, rather than the original owner. 


Another interesting behavior of tar is the way it handles pathnames in 
archives. The default for pathnames is relative, rather than absolute. tar 
does this by simply removing any leading slash from the pathname when 
creating the archive. To demonstrate, we will re-create our archive, this 
time specifying an absolute pathname. 


[me@linuxbox foo]$ cd 
[me@linuxbox ~]$ tar cf playground2.tar ~/playground 


Remember, ~/playground will expand into /home/me/playground when 
we press enter, so we will get an absolute pathname for our demonstration. 
Next, we will extract the archive as before and watch what happens. 


[me@linuxbox ~]$ cd foo 

[me@linuxbox foo]$ tar xf ../playground2.tar 
[me@linuxbox foo]$ 1s 

home playground 

[me@linuxbox foo]$ 1s home 

me 

[me@linuxbox foo]$ 1s home/me 

playground 


Here we can see that when we extracted our second archive, it re-created 
the directory home/me/playground relative to our current working directory, 
~/foo, not relative to the root directory, as would have been the case with an 
absolute pathname. This might seem like an odd way for it to work, but it’s 
actually more useful this way because it allows us to extract archives to any 
location rather than being forced to extract them to their original locations. 
Repeating the exercise with the inclusion of the verbose option (v) will give 
a clearer picture of what’s going on. 

Let’s consider a hypothetical, yet practical, example of tar in action. 
Imagine we want to copy the home directory and its contents from one sys- 
tem to another and we have a large USB hard drive that we can use for the 
transfer. On our modern Linux system, the drive is “automagically” mounted 
in the /media directory. Let’s also imagine that the disk has a volume name of 
BigDisk when we attach it. To make the tar archive, we can do the following: 


[me@linuxbox ~]$ sudo tar cf /media/BigDisk/home.tar /home 


After the tar file is written, we unmount the drive and attach it to the 
second computer. Again, it is mounted at /media/BigDisk. To extract the 
archive, we do this: 


[me@linuxbox2 ~]$ cd / 
[me@linuxbox2 /]$ sudo tar xf /media/BigDisk/home.tar 


What’s important to see here is that we must first change directory to / 
so that the extraction is relative to the root directory since all pathnames 
within the archive are relative. 
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When extracting an archive, it’s possible to limit what is extracted from 
the archive. For example, if we wanted to extract a single file from an archive, 
it could be done like this: 


tar xf archive.tar pathname 


By adding the trailing pathname to the command, tar will restore only 
the specified file. Multiple pathnames may be specified. Note that the path- 
name must be the full, exact relative pathname as stored in the archive. 
When specifying pathnames, wildcards are not normally supported; how- 
ever, the GNU version of tar (which is the version most often found in 
Linux distributions) supports them with the --wildcards option. Here is an 
example using our previous playground.tar file: 


[me@linuxbox ~]$ cd foo 
[me@linuxbox foo]$ tar xf ../playground2.tar --wildcards 'home/me/playground/dir-*/file-A’' 


This command will extract only files matching the specified pathname 
including the wildcard dir-*. 

tar is often used in conjunction with find to produce archives. In this 
example, we will use find to produce a set of files to include in an archive. 


[me@linuxbox ~]$ 


find playground -name 'file-A' -exec tar rf playground.tar '{}' '+' 


Here we use find to match all the files in playground named /file-A and 
then, using the -exec action, we invoke tar in the append mode (r) to add 
the matching files to the archive playground.tav. 

Using tar with find is a good way of creating incremental backups of a 
directory tree or an entire system. By using find to match files newer than 
a timestamp file, we could create an archive that contains only those files 
newer than the last archive, assuming that the timestamp file is updated 
right after each archive is created. 

tar can also make use of both standard input and output. Here is a 
comprehensive example: 


[me@linuxbox foo]$ cd 
[me@linuxbox ~]$ find playground -name 'file-A' | tar cf - --files-from=- | gzip > playground.tgz 
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In this example, we used the find program to produce a list of match- 
ing files and piped them into tar. If the filename - is specified, it is taken 
to mean standard input or output, as needed. (By the way, this convention 
of using - to represent standard input/output is used by a number of other 
programs, too). The --files-from option (which may also be specified as -T) 
causes tar to read its list of pathnames from a file rather than the command 
line. Lastly, the archive produced by tar is piped into gzip to create the com- 
pressed archive playground.igz. The .igz extension is the conventional exten- 
sion given to gzip-compressed tar files. The extension .tar.gz is also used 
sometimes. 


While we used the gzip program externally to produce our compressed 
archive, modern versions of GNU tar support both gzip and bzip2 compres- 
sion directly with the use of the z and j options, respectively. Using our pre- 
vious example as a base, we can simplify it this way: 


[me@linuxbox ~]$ find playground -name 'file-A' | tar czf playground.tgz -T - 


If we had wanted to create a bzip2-compressed archive instead, we could 
have done this: 


[me@linuxbox ~]$ find playground -name 'file-A' | tar cjf playground.tbz -T - 


By simply changing the compression option from z to j (and changing 
the output file’s extension to .tbz to indicate a bzip2-compressed file), we 
enabled bzip2 compression. 

Another interesting use of standard input and output with the tar com- 
mand involves transferring files between systems over a network. Imagine 
that we had two machines running a Unix-like system equipped with tar and 
ssh. In such a scenario, we could transfer a directory from a remote system 
(named remote-sys for this example) to our local system. 


[me@linuxbox ~]$ mkdir remote-stuff 

[me@linuxbox ~]$ cd remote-stuff 

[me@linuxbox remote-stuff]$ ssh remote-sys ‘tar cf - Documents’ | tar xf - 
me@remote-sys's password: 

[me@linuxbox remote-stuff]$ 1s 

Documents 


Here we were able to copy a directory named Documents from the remote 
system remote-sys to a directory within the directory named remote-stuffon the 
local system. How did we do this? First, we launched the tar program on the 
remote system using ssh. You will recall that ssh allows us to execute a pro- 
gram remotely on a networked computer and “see” the results on the local 
system—the standard output produced on the remote system is sent to the 
local system for viewing. We can take advantage of this by having tar create 
an archive (the c mode) and send it to standard output, rather than a file 
(the f option with the dash argument), thereby transporting the archive over 
the encrypted tunnel provided by ssh to the local system. On the local system, 
we execute tar and have it expand an archive (the x mode) supplied from 
standard input (again, the f option with the dash argument). 


zip 
The zip program is both a compression tool and an archiver. The file 
format used by the program is familiar to Windows users, as it reads and 
writes .z7p files. In Linux, however, gzip is the predominant compression 
program, with bzip2 being a close second. 

In its most basic usage, zip is invoked like this: 


zip options zipfile file... 
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For example, to make a zip archive of our playground, we would do this: 


[me@linuxbox ~]$ zip -r playground.zip playground 


Unless we include the -r option for recursion, only the playground 
directory (but none of its contents) is stored. Although the addition of the 
extension .z7p is automatic, we will include the file extension for clarity. 

During the creation of the zip archive, zip will normally display a series 
of messages like this: 


adding: playground/dir-020/file-Z (stored 0%) 
adding: playground/dir-020/file-Y (stored 0%) 
adding: playground/dir-020/file-X (stored 0%) 
adding: playground/dir-087/ (stored 0%) 

adding: playground/dir-087/file-S (stored 0%) 


These messages show the status of each file added to the archive. zip will 
add files to the archive using one of two storage methods: either it will “store” 
a file without compression, as shown here, or it will “deflate” the file that per- 
forms compression. The numeric value displayed after the storage method 
indicates the amount of compression achieved. Since our playground con- 
tains only empty files, no compression is performed on its contents. 

Extracting the contents of a zip file is straightforward when using the 
unzip program. 


[me@linuxbox ~]$ cd foo 
[me@linuxbox foo]$ unzip ../playground.zip 


One thing to note about zip (as opposed to tar) is that if an existing 
archive is specified, it is updated rather than replaced. This means the 
existing archive is preserved, but new files are added, and matching files 
are replaced. 

We can list and extract files selectively from a zip archive by specifying 
them to unzip. 


[me@linuxbox ~]$ unzip -1 playground.zip playground/dir-087/file-Z 
Archive: ../playground.zip 
Length Date Time Name 


0 10-05-18 09:25 playground/dir-087/file-Z 


[me@linuxbox ~]$ cd foo 

[me@linuxbox foo]$ unzip ../playground.zip playground/dir-087/file-Z 
Archive: ../playground.zip 

replace playground/dir-087/file-Z? [y]Jes, [n]o, [A]1l, [NJone, [r]ename: y 
extracting: playground/dir-087/file-Z 


Using the -1 option causes unzip to merely list the contents of the archive 
without extracting the file. Ifno files are specified, unzip will list all files in 


the archive. The -v option can be added to increase the verbosity of the list- 
ing. Note that when the archive extraction conflicts with an existing file, the 
user is prompted before the file is replaced. 

Like tar, zip can make use of standard input and output, though its 
implementation is somewhat less useful. It is possible to pipe a list of file- 
names to zip via the -@ option. 


[me@linuxbox foo]$ cd 
[me@linuxbox ~]$ find playground -name "file-A" | zip -@ file-A.zip 


Here we use find to generate a list of files matching the test -name "file-A" 
and then pipe the list into zip, which creates the archive /ile-A.zip containing 
the selected files. 

zip also supports writing its output to standard output, but its use is lim- 
ited because few programs can make use of the output. Unfortunately, the 
unzip program does not accept standard input. This prevents zip and unzip 
from being used together to perform network file copying like tar. 

zip can, however, accept standard input, so it can be used to compress 
the output of other programs. 


[me@linuxbox ~]$ 1s -1 /etc/ | zip ls-etc.zip - 
adding: - (deflated 80%) 


In this example, we pipe the output of 1s into zip. Like tar, zip interprets 
the trailing dash as “use standard input for the input file.” 

The unzip program allows its output to be sent to standard output when 
the -p (for pipe) option is specified. 


[me@linuxbox ~]$ unzip -p ls-etc.zip | less 


We touched on some of the basic things that zip/unzip can do. They 
both have a lot of options that add to their flexibility, though some are 
platform specific to other systems. The man pages for both zip and unzip 
are pretty good and contain useful examples; however, the main use of 
these programs is for exchanging files with Windows systems, rather than 
performing compression and archiving on Linux, where tar and gzip are 
greatly preferred. 


Synchronizing Files and Directories 


A common strategy for maintaining a backup copy of a system involves 
keeping one or more directories synchronized with another directory (or 
directories) located on either the local system (usually a removable storage 
device of some kind) or a remote system. We might, for example, have a 
local copy of a website under development and synchronize it from time to 
time with the “live” copy on a remote web server. 

In the Unix-like world, the preferred tool for this task is rsync. This 
program can synchronize both local and remote directories by using the 
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rsync remote-update protocol, which allows rsync to quickly detect the dif- 
ferences between two directories and perform the minimum amount of 
copying required to bring them into sync. This makes rsync very fast and 
economical to use, compared to other kinds of copy programs. 

rsync is invoked like this: 


rsync options source destination 


where source and destination are one of the following: 


e = Alocal file or directory 
e Aremote file or directory in the form of [user@/host:path 


e Aremote rsync server specified with a URI of rsync://[user@] 
host[:port]/path 


Note that either the source or the destination must be a local file. 
Remote-to-remote copying is not supported. 
Let’s try rsync on some local files. First, let’s clean out our foo directory. 


[me@linuxbox ~]$ rm -rf foo/* 


Next, we’ll synchronize the playground directory with a corresponding 
copy in foo. 


[me@linuxbox ~]$ rsync -av playground foo 


We’ve included both the -a option (for archiving—causes recursion 
and preservation of file attributes) and the -v option (verbose output) to 
make a mirror of the playground directory within foo. While the command 
runs, we will see a list of the files and directories being copied. At the end, 
we will see a summary message like this indicating the amount of copying 
performed: 


sent 135759 bytes received 57870 bytes 387258.00 bytes/sec 
total size is 3230 speedup is 0.02 


If we run the command again, we will see a different result. 


[me@linuxbox ~]$ rsync -av playground foo 
building file list ... done 


sent 22635 bytes received 20 bytes 45310.00 bytes/sec 
total size is 3230 speedup is 0.14 


Notice that there was no listing of files. This is because rsync detected 
that there were no differences between ~/playground and ~/foo/playground, 


and therefore it didn’t need to copy anything. If we modify a file in 
playground and run rsync again: 


[me@linuxbox ~]$ touch playground/dir-099/file-Z 
[me@linuxbox ~]$ rsync -av playground foo 

building file list ... done 

playground/dir-099/file-Z 

sent 22685 bytes received 42 bytes 45454.00 bytes/sec 
total size is 3230 speedup is 0.14 


we see that rsync detected the change and copied only the updated file. 
There is a subtle but useful feature we can use when we specify an rsync 
source. Let’s consider two directories. 


[me@linuxbox ~]$ 1s 
source destination 


Directory source contains one file named /filel, and directory destination 
is empty. If we perform a copy of source to destination like so: 


[me@linuxbox ~]$ rsync source destination 


then rsync copies the directory source into destination. 


[me@linuxbox ~]$ ls destination 
source 


However, if we append a trailing / to the source directory name, rsync 
will copy only the contents of the source directory and not the directory itself. 


[me@linuxbox ~]$ rsync source/ destination 
[me@linuxbox ~]$ ls destination 
filet 


This is handy if we want only the contents of a directory copied without 
creating another level of directories within the destination. We can think of 
it as being like source/* in its outcome, but this method will copy all of the 
source directory’s content including the hidden files. 

As a practical example, let’s consider the imaginary external hard drive 
that we used earlier with tar. If we attach the drive to our system and, once 
again, it is mounted at /media/BigDisk, we can perform a useful system 
backup by first creating a directory named /backup on the external drive 
and then using rsync to copy the most important stuff from our system to 
the external drive. 


[me@linuxbox ~]$ mkdir /media/BigDisk/backup 
[me@linuxbox ~]$ sudo rsync -av --delete /etc /home /usr/local /media/BigDisk/backup 
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In this example, we copied the /etc, /home, and /usr/local directories 
from our system to our imaginary storage device. We included the --delete 
option to remove files that may have existed on the backup device that no 
longer existed on the source device (this is irrelevant the first time we make 
a backup but will be useful on subsequent copies). Repeating the procedure 
of attaching the external drive and running this rsync command would be 
a useful (though not ideal) way of keeping a small system backed up. Of 
course, an alias would be helpful here, too. We could create an alias and 
add it to our .bashrc file to provide this feature. 


alias backup='sudo rsync -av --delete /etc /home /usr/local /media/BigDisk/backup' 


Now all we have to do is attach our external drive and run the backup 
command to do the job. 


Using rsync over a Network 


One of the real beauties of rsync is that it can be used to copy files over a 
network. After all, the r in rsync stands for “remote.” Remote copying can 
be done in one of two ways. The first way is with another system that has 
rsync installed, along with a remote shell program such as ssh. Let’s say we 
had another system on our local network with a lot of available hard drive 
space and we wanted to perform our backup operation using the remote 
system instead of an external drive. Assuming that it already had a directory 
named /backup where we could deliver our files, we could do this: 


[me@linuxbox ~]$ 


sudo rsync -av --delete --rsh=ssh /etc /home /usr/local remote-sys:/backup 
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We made two changes to our command to facilitate the network 
copy. First, we added the --rsh=ssh option, which instructs rsync to use 
the ssh program as its remote shell. In this way, we were able to use an 
ssh-encrypted tunnel to securely transfer the data from the local system 
to the remote host. Second, we specified the remote host by prefixing its 
name (in this case the remote host is named remote-sys) to the destination 
pathname. 

The second way that rsync can be used to synchronize files over a net- 
work is by using an rsync server. rsync can be configured to run as a daemon 
and listen to incoming requests for synchronization. This is often done 
to allow mirroring of a remote system. For example, Red Hat Software 
maintains a large repository of software packages under development for 
its Fedora distribution. It is useful for software testers to mirror this collec- 
tion during the testing phase of the distribution release cycle. Since files in 
the repository change frequently (often more than once a day), it is desir- 
able to maintain a local mirror by periodic synchronization, rather than 


by bulk copying of the repository. One of these repositories is kept at Duke 
University; we could mirror it using our local copy of rsync and their rsync 
server like this: 


[me@linuxbox ~]$ mkdir fedora-devel 
[me@linuxbox ~]$ rsync -av -delete rsync: //archive.linux.duke.edu/fedora/ 
linux/development/rawhide/Everything/x86_64/os/ fedora-devel 


In this example, we use the URI of the remote rsync server, which consists 
of a protocol (rsync://), followed by the remote hostname (archive.linux.duke 
.edu), followed by the pathname of the repository. 


Summing Up 


We’ve looked at the common compression and archiving programs used 
on Linux and other Unix-like operating systems. For archiving files, the 
tar/gzip combination is the preferred method on Unix-like systems, while 
zip/unzip is used for interoperability with Windows systems. Finally, we 
looked at the rsync program (a personal favorite), which is very handy for 
efficient synchronization of files and directories across systems. 
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REGULAR EXPRESSIONS 


In the next few chapters, we are going to 
look at tools used to manipulate text. As 


we have seen, text data plays an important 

role on all Unix-like systems, such as Linux. But 
before we can fully appreciate all the features offered 
by these tools, first we have to examine a technology 
that is frequently associated with the most sophisticated 
uses of these tools—regular expressions. 


As we have navigated the many features and facilities offered by the 
command line, we have encountered some truly arcane features, such as 
shell expansion and quoting, keyboard shortcuts, and command history, 
not to mention the vi editor. Regular expressions continue this “tradition” 
and may be (arguably) the most arcane feature of them all. This is not to 
suggest that the time it takes to learn about them is not worth the effort. 
Quite the contrary. A good understanding will enable us to perform amaz- 
ing feats, though their full value may not be immediately apparent. 


What Are Regular Expressions? 


Simply put, regular expressions are symbolic notations used to identify 
patterns in text. In some ways, they resemble the shell’s wildcard method 
of matching file and pathnames but on a much grander scale. Regular 
expressions are supported by many command line tools and by most 
programming languages to facilitate the solution of text manipulation 
problems. However, to further confuse things, not all regular expressions 
are the same; they vary slightly from tool to tool and from programming 
language to language. For our discussion, we will limit ourselves to regular 
expressions as described in the POSIX standard (which will cover most of 
the command line tools), as opposed to many programming languages 
(most notably Perl), which use slightly larger and richer sets of notations. 


grep 
The main program we will use to work with regular expressions is our old 
pal grep. The name gvep is actually derived from the phrase “global regular 
expression print,” so we can see that grep has something to do with regular 
expressions. In essence, grep searches text files for text matching a specified 
regular expression and outputs any line containing a match to standard 
output. 

So far, we have used grep with fixed strings, like so: 


[me@linuxbox ~]$ 1s /usr/bin | grep zip 


This will list all the files in the /usr/bin directory whose names contain 
the substring zip. 

The grep program accepts options and arguments this way, where regex 
is a regular expression: 


grep [options] regex [file...] 
Table 19-1 describes the commonly used grep options. 


Table 19-1: grep Options 
Option Long option Description 
-i --ignore-case Ignore case. Do not distinguish between uppercase 
and lowercase characters. 


-V --invert-match Invert match. Normally, grep prints lines that contain 
a match. This option causes grep to print every line 
that does not contain a match. 


-C --count Print the number of matches (or non-matches if 
the -v option is also specified) instead of the lines 
themselves. 

-1 --files-with-matches Print the name of each file that contains a match 


instead of the lines themselves. 
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Option Long option Description 


-L --files-without-match Like the -1 option, but print only the names of files 
that do not contain matches. 

-n --line-number Prefix each matching line with the number of the line 
within the file. 

-h --no- filename For multifile searches, suppress the output of filenames. 


To more fully explore grep, let’s create some text files to search. 


[me@linuxbox ~]$ 1s /bin > dirlist-bin.txt 

[me@linuxbox ~]$ 1s /usr/bin > dirlist-usr-bin.txt 
[me@linuxbox ~]$ 1s /sbin > dirlist-sbin.txt 

[me@linuxbox ~]$ 1s /usr/sbin > dirlist-usr-sbin.txt 
[me@linuxbox ~]$ ls dirlist*.txt 

dirlist-bin.txt dirlist-sbin.txt dirlist-usr-sbin.txt 


dirlist-usr-bin.txt 


We can perform a simple search of our list of files like this: 


[me@linuxbox ~]$ grep bzip dirlist*.txt 
dirlist-bin.txt:bzip2 
dirlist-bin.txt:bzip2recover 


In this example, grep searches all the listed files for the string bzip and 
finds two matches, both in the file dérlist-bin.ixt. If we were interested only in 
the list of files that contained matches rather than the matches themselves, 
we could specify the -1 option. 


[me@linuxbox ~]$ grep -1 bzip dirlist*.txt 
dirlist-bin.txt 


Conversely, if we wanted to see only a list of the files that did not con- 
tain a match, we could do this: 


[me@linuxbox ~]$ grep -L bzip dirlist*.txt 
dirlist-sbin.txt 

dirlist-usr-bin.txt 

dirlist-usr-sbin.txt 


Metacharacters and Literals 


While it might not seem apparent, our grep searches have been using regular 
expressions all along, albeit very simple ones. The regular expression bzip 

is taken to mean that a match will occur only if the line in the file contains 
at least four characters and that somewhere in the line the characters , 

z, , and pare found in that order, with no other characters in between. 

The characters in the string bzipare all literal characters, in that they match 
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themselves. In addition to literals, regular expressions may also include 
metacharacters that are used to specify more complex matches. Regular 
expression metacharacters consist of the following: 


Maal Te CaCI SS 


All other characters are considered literals, though the backslash char- 
acter is used in a few cases to create metasequences, as well as allowing the 
metacharacters to be escaped and treated as literals instead of being inter- 
preted as metacharacters. 


As we can see, many of the regular expression metacharacters are also characters 
that have meaning to the shell when expansion is performed. When we pass regular 
expressions containing metacharacters on the command line, it is vital that they be 
enclosed in quotes to prevent the shell from attempting to expand them. 


The Any Character 


The first metacharacter we will look at is the dot (.) or period character, 
which is used to match any character. If we include it in a regular expression, 
it will match any character in that character position. Here’s an example: 


[me@linuxbox ~]$ grep -h '.zip' dirlist*.txt 
bunzip2 
bzip2 
bzip2recover 
gunzip 

gzip 

funzip 
gpg-zip 
preunzip 
prezip 
prezip-bin 
unzip 
unzipsfx 


We searched for any line in our files that matches the regular expression 
.zip. There are a couple of interesting things to note about the results. Notice 
that the zip program was not found. This is because the inclusion of the dot 
metacharacter in our regular expression increased the length of the required 
match to four characters, and because the name z?p contains only three, it 
does not match. Also, if any files in our lists had contained the file extension 
.zip, they would have been matched as well, because the period character in 
the file extension would be matched by the “any character,” too. 


Anchors 


The caret (*) and dollar sign ($) are treated as anchors in regular expressions. 
This means they cause the match to occur only if the regular expression is 
found at the beginning of the line (*) or at the end of the line ($). 
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[me@linuxbox ~]$ grep -h '*zip' dirlist*.txt 
zip 

zipcloak 

zipgrep 

zipinfo 

zipnote 

zipsplit 

[me@linuxbox ~]$ grep -h 'zip$' dirlist*.txt 
gunzip 

gzip 

funzip 

gpg-zip 

preunzip 

prezip 

unzip 

zip 

[me@linuxbox ~]$ grep -h '“zip$' dirlist*.txt 
zip 


Here we searched the list of files for the string zip located at the begin- 
ning of the line, at the end of the line, and on a line where it is at both the 
beginning and the end of the line (i.e., by itself on the line). Note that the 
regular expression *$ (a beginning and an end with nothing in between) 
will match blank lines. 


A CROSSWORD PUZZLE HELPER 


Even with our limited knowledge of regular expressions at this point, we can do 
something useful. 

My wife loves crossword puzzles, and she will sometimes ask me for help 
with a particular question. Something like, “What's a five-letter word whose 
third letter is j and last letter is r that means . . . 2” This kind of question got me 
thinking. 

Did you know that your Linux system contains a dictionary? It does. Take 
a look in the /usr/share/dict directory, and you might find one or several. The 
dictionary files located there are just long lists of words, one per line, arranged 


in alphabetical order. On my system, the word's file contains just over 98,500 
words. To find possible answers to the crossword puzzle question above, we 
could do this: 


[me@linuxbox ~]$ grep -i '*..j.r$' /usy/share/dict/words 
Major 
major 


Using this regular expression, we can find all the words in our dictionary 
file that are five letters long and have a j in the third position and an r in the 
last position. 
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In addition to matching any character at a given position in our regular 
expression, we can also match a single character from a specified set of 
characters by using bracket expressions. With bracket expressions, we can 
specify a set of characters (including characters that would otherwise be 
interpreted as metacharacters) to be matched. In this example, using a 
two-character set, we match any line that contains the string bzip or gzip: 


[me@linuxbox ~]$ grep -h '[bg]zip' dirlist*.txt 
bzip2 

bzip2recover 

gzip 


A set may contain any number of characters, and metacharacters lose 
their special meaning when placed within brackets. However, there are two 
cases in which metacharacters are used within bracket expressions and have 
different meanings. The first is the caret (*), which is used to indicate nega- 
tion; the second is the dash (-), which is used to indicate a character range. 


Negation 


If the first character in a bracket expression is a caret (*), the remaining char- 
acters are taken to be a set of characters that must not be present at the given 
character position. We do this by modifying our previous example, as follows: 


[me@linuxbox ~]$ grep -h '[*bg]zip' dirlist*.txt 
bunzip2 

gunzip 

funzip 

gpg-Zip 

preunzip 

prezip 

prezip-bin 

unzip 

unzipsfx 


With negation activated, we get a list of files that contain the string zip 
preceded by any character except or g. Notice that the file zp was not found. 
A negated character set still requires a character at the given position, but the 
character must not be a member of the negated set. 

The caret character invokes negation only if it is the first character 
within a bracket expression; otherwise, it loses its special meaning and 
becomes an ordinary character in the set. 


Traditional Character Ranges 


If we wanted to construct a regular expression that would find every file in 
our lists beginning with an uppercase letter, we could do this: 


[me@linuxbox ~]$ grep -h '*[ABCDEFGHIJKLMNOPORSTUVWXZY]' dirlist*.txt 


It’s just a matter of putting all 26 uppercase letters in a bracket expres- 
sion. But the idea of all that typing is deeply troubling, so here is another way. 


[me@linuxbox ~]$ grep -h '*[A-Z]' dirlist*.txt 
MAKEDEV 

ControlPanel 

GET 

HEAD 

POST 

xX 

X11 

Xorg 

MAKEFLOPPIES 
NetworkManager 
NetworkManagerDispatcher 


By using a three-character range, we can abbreviate the 26 letters. Any 
range of characters can be expressed this way including multiple ranges, 
such as this expression that matches all filenames starting with letters and 
numbers: 


[me@linuxbox ~]$ grep -h '*[A-Za-z0-9]' dirlist*.txt 


In character ranges, we see that the dash character is treated specially, so 
how do we actually include a dash character in a bracket expression? By mak- 
ing it the first character in the expression. Consider these two examples: 


[me@linuxbox ~]$ grep -h '[A-Z]' dirlist*.txt 


This will match every filename containing an uppercase letter. The fol- 
lowing will match every filename containing a dash or an uppercase A or an 
uppercase Z: 


[me@linuxbox ~]$ grep -h '[-AZ]' dirlist*.txt 


POSIX Character Classes 


The traditional character ranges are an easily understood and effective way 
to handle the problem of quickly specifying sets of characters. Unfortunately, 
they don’t always work. While we have not encountered any problems with 
our use of grep so far, we might run into problems using other programs. 

In Chapter 4, we looked at how wildcards are used to perform path- 
name expansion. In that discussion, we said that character ranges could 
be used in a manner almost identical to the way they are used in regular 
expressions, but here’s the problem: 


[me@linuxbox ~]$ 1s /usr/sbin/[ABCDEFGHIJKLMNOPORSTUVWXYZ ]* 
/usr/sbin/ModemManager 
/usr/sbin/NetworkManager 
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(Depending on the Linux distribution, we will get a different list of files, 
possibly an empty list. This example is from Ubuntu.) This command pro- 
duces the expected result—a list of only the files whose names begin with an 
uppercase letter, but with the following command we get an entirely different 
result (only a partial listing of the results is shown): 


[me@linuxbox ~]$ 1s /usr/sbin/[A-Z]* 
/usx/sbin/biosdecode 

/usx/sbin/chat 

/usr/sbin/chgpasswd 
/usr/sbin/chpasswd 

/usx/sbin/chroot 
/usr/sbin/cleanup-info 
/usr/sbin/complain 
/usx/sbin/console-kit-daemon 


Why is that? It’s a long story, but here’s the short version: 

Back when Unix was first developed, it knew only about ASCII char- 
acters, and this feature reflects that fact. In ASCII, the first 32 characters 
(numbers 0-31) are control codes (things such as tabs, backspaces, and 
carriage returns). The next 32 (32-63) contain printable characters, 
including most punctuation characters and the numerals 0—9. The next 
32 (numbers 64-95) contain the uppercase letters and a few more punctu- 
ation symbols. The final 31 (numbers 96-127) contain the lowercase letters 
and yet more punctuation symbols. Based on this arrangement, systems 
using ASCII used a collation order that looks like this: 


ABCDEFGHIJKLMNOPORSTUVWXYZabcdefghijklmnopqrstuvwxyz 


This differs from proper dictionary order, which is like this: 


aAbBcCdDeE fF gGhHil j JkK1LmMnNoOpPqOrRsStTuUvVwWxXy¥zZ 


As the popularity of Unix spread beyond the United States, there grew 
a need to support characters not found in US English. The ASCII table was 
expanded to use a full eight bits, adding characters 128-255, which accom- 
modated many more languages. To support this capability, the POSIX stan- 
dards introduced a concept called a locale, which could be adjusted to select 
the character set needed for a particular location. We can see the language 
setting of our system using the following command. 


[me@linuxbox ~]$ echo $LANG 
en_US.UTF-8 


With this setting, POSITX-compliant applications will use a dictionary 
collation order rather than ASCII order. This explains the behavior of the 
previous commands. A character range of [A-Z] when interpreted in dic- 
tionary order includes all of the alphabetic characters except the lowercase 
a, hence our results. 


To partially work around this problem, the POSIX standard includes 
a number of character classes that provide useful ranges of characters, as 
described in Table 19-2. 


Table 19-2: POSIX Character Classes 


Character class 
[:alnum: ] 
[:word: ] 
[:alpha: ] 
[:blank: ] 


[:cntr1:] 


[:digit: ] 
[:graph: ] 


[: lower: ] 


[:punct: ] 


[:print: ] 


[:space: ] 


[:upper: ] 
[:xdigit: ] 


Description 
The alphanumeric characters. In ASCII, equivalent to: [A-Za-z0-9] 


The same as [:alnum: ], with the addition of the underscore (_} 
character. 


The alphabetic characters. In ASCII, equivalent to: [A-Za-z] 
Includes the space and tab characters. 


The ASCII control codes. Includes the ASCII characters O through 
31 and 127. 


The numerals 0 through 9. 


The visible characters. In ASCII, it includes characters 33 
through 126. 


The lowercase letters. 

The punctuation characters. In ASCII, equivalent to: 

[-1"#9%8" ()¥4, 52> 2@1\W] CI 

The printable characters. All the characters in [:graph:] plus the 
space character. 

The whitespace characters including space, tab, carriage return, 
newline, vertical tab, and form feed. In ASCII, equivalent to: 

[ \t\r\n\v\f] 

The uppercase characters. 

Characters used to express hexadecimal numbers. In ASCII, equiv- 
alent to: [0-9A-Fa-f] 


Even with the character classes, there is still no convenient way to 
express partial ranges, such as [A-M]. 
Using character classes, we can repeat our directory listing and see an 


improved result. 


[me@linuxbox ~]$ 1s /usr/sbin/[[:upper: ]]* 
/usr/sbin/MAKEFLOPPIES 
/usr/sbin/NetworkManagerDispatcher 
/usr/sbin/NetworkManager 


Remember, however, that this is not an example of a regular expression; 
rather, it is the shell performing pathname expansion. We show it here 
because POSIX character classes can be used for both. 
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REVERTING TO TRADITIONAL COLLATION ORDER 


You can opt to have your system use the traditional (ASCII) collation order by 
changing the value of the LANG environment variable. As we saw earlier, the 
LANG variable contains the name of the language and character set used in 
your locale. This value was originally determined when you selected an instal- 
lation language as your Linux version was installed. 

To see the locale settings, use the locale command. 


[me@linuxbox ~]$ locale 
LANG=en_US.UTF-8 
LC_CTYPE="en_US.UTF-8" 
LC_NUMERIC="en_US.UTF-8" 
LC_TIME="en_US.UTF-8" 
LC_COLLATE="en_US.UTF-8" 
LC_MONETARY="en_US.UTF-8" 
LC_MESSAGES="en_US.UTF-8" 
LC_PAPER="en_US.UTF-8" 
LC_NAME="en_US.UTF-8" 
LC_ADDRESS="en_US.UTF-8" 
LC_TELEPHONE="en_US.UTF-8”" 
LC_MEASUREMENT="en_US.UTF-8" 
LC_IDENTIFICATION="en_US.UTF-8" 
LC_ALL= 


To change the locale to use the traditional Unix behaviors, set the LANG 
variable to POSIX. 


[me@linuxbox ~]$ export LANG=POSIX 


Note that this change converts the system to use US English (more specifi- 
cally, ASCII) for its character set, so be sure if this is really what you want. 


You can make this change permanent by adding this line to your .bashrc file: 


export LANG=POSIX 


POSIX Basic vs. Extended Regular Expressions 
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Just when we thought this couldn’t get any more confusing, we discover that 
POSIX also splits regular expression implementations into two kinds: basic 
regular expressions (BRE) and extended regular expressions (ERE). The features 
we have covered so far are supported by any application that is POSIX com- 
pliant and implements BRE. Our grep program is one such program. 
What’s the difference between BRE and ERE? It’s a matter of meta- 
characters. With BRE, the following metacharacters are recognized: 


eg Lal 


All other characters are considered literals. With ERE, the following 
metacharacters (and their associated functions) are added: 


(Vid? * | 


However (and this is the fun part), the (, ), {, and } characters are treated 
as metacharacters in BRE if they are escaped with a backslash, whereas with 
ERE, preceding any metacharacter with a backslash causes it to be treated as 
a literal. Any weirdness that comes along will be covered in the discussions 
that follow. 

Because the features we are going to discuss next are part of ERE, we 
are going to need to use a different grep. Traditionally, this has been per- 
formed by the egrep program, but the GNU version of grep also supports 
extended regular expressions when the -E option is used. 


POSIX 


During the 1980s, Unix became a very popular commercial operating system, 
but by 1988, the Unix world was in turmoil. Many computer manufacturers 
had licensed the Unix source code from its creators, AT&T, and were supplying 
various versions of the operating system with their systems. However, in their 
efforts to create product differentiation, each manufacturer added proprietary 
changes and extensions. This started to limit the compatibility of the software. 
As always with proprietary vendors, each was trying to play a winning game 
of “lock in” with their customers. This dark time in the history of Unix is known 
today as the Balkanization. 

Enter the Institute of Electrical and Electronics Engineers (IEEE). In the mid- 
1980s, the IEEE began developing a set of standards that would define how 
Unix (and Unix-like) systems would perform. These standards, formally known 
as IEEE 1003, define the application programming interfaces (APIs), shell, and 
utilities that are to be found on a standard Unix-like system. The name POSIX, 
which stands for Portable Operating System Interface (with the X added to 
the end for extra snappiness), was suggested by Richard Stallman (yes, that 
Richard Stallman) and was adopted by the IEEE. 


Alternation 


The first of the extended regular expression features we will discuss is 
called alternation, which is the facility that allows a match to occur from 
among a set of expressions. Just as a bracket expression allows a single char- 
acter to match from a set of specified characters, alternation allows matches 
from a set of strings or other regular expressions. 

To demonstrate, we’ll use grep in conjunction with echo. First, let’s try a 
plain old string match. 
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[me@linuxbox ~]$ echo "AAA" | grep AAA 
AAA 

[me@linuxbox ~]$ echo "BBB" | grep AAA 
[me@linuxbox ~]$ 


This is a pretty straightforward example, in which we pipe the output 
of echo into grep and see the results. When a match occurs, we see it printed 
out; when no match occurs, we see no results. 

Now we'll add alternation, signified by the vertical-bar metacharacter. 


[me@linuxbox ~]$ echo "AAA" | grep -E 'AAA|BBB' 
AAA 

[me@linuxbox ~]$ echo "BBB" | grep -E 'AAA|BBB' 
BBB 

[me@linuxbox ~]$ echo "CCC" | grep -E 'AAA|BBB' 
[me@linuxbox ~]$ 


Here we see the regular expression 'AAA|BBB', which means “match 
either the string AAA or the string BBB.” Notice that since this is an extended 
feature, we added the -E option to grep (though we could have just used the 
egrep program instead), and we enclosed the regular expression in quotes to 
prevent the shell from interpreting the vertical-bar metacharacter as a pipe 
operator. Alternation is not limited to two choices. 


[me@linuxbox ~]$ echo "AAA" | grep -E 'AAA|BBB|CCC' 
AAA 


To combine alternation with other regular expression elements, we can 
use () to separate the alternation. 


[me@linuxbox ~]$ grep -Eh '*(bz|gz|zip)' dirlist*.txt 


This expression will match the filenames in our lists that start with 
either bz, gz, or zip. Had we left off the parentheses, the meaning of this 
regular expression changes to match any filename that begins with bz or 
contains gz or contains zip: 


[me@linuxbox ~]$ grep -Eh '*bz|gz|zip' dirlist*.txt 
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Extended regular expressions support several ways to specify the number of 
times an element is matched, as described in the sections that follow. 


?—Match an Element Zero or One Time 


This quantifier means, in effect, “make the preceding element optional.” 
Let’s say we wanted to check a phone number for validity and we considered 


a phone number to be valid if it matched either of these two forms, where n 
is a numeral: 


e (nnn) nnn-nnnn 


e NNN NNN-NNNN 


We could construct a regular expression like this: 


\(?[0-9][0-9][0-9]\)? [o-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$ 


In this expression, we follow the parentheses characters with question 
marks to indicate that they are to be matched zero or one time. Again, 
because the parentheses are normally metacharacters (in ERE), we pre- 
cede them with backslashes to cause them to be treated as literals instead. 

Let’s try it. 


[me@linuxbox ~]$ echo "(555) 123-4567" | grep -E '*\(?[0-9][0-9][0-9] 
\)? [0-9] [0-9][0-9]-[0-9][0-9][0-9][0-9]$' 

(555) 123-4567 

[me@linuxbox ~]$ echo "555 123-4567" | grep -E '*\(?[0-9][0-9][0-9]\) 
? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$' 

555 123-4567 

[me@linuxbox ~]$ echo "AAA 123-4567" | grep -E '*\(?[0-9][0-9][0-9]\) 
? [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$' 

[me@linuxbox ~]$ 


Here we see that the expression matches both forms of the phone 
number but does not match one containing non-numeric characters. This 
expression is not perfect as it still allows mismatched parentheses around 
the area code, but it will perform the first stage of a verification. 


*—Match an Element Zero or More Times 


Like the ? metacharacter, the * is used to denote an optional item; how- 
ever, unlike the ?, the item may occur any number of times, not just once. 
Let’s say we wanted to see whether a string was a sentence; that is, it starts 
with an uppercase letter, then contains any number of uppercase and lower- 
case letters and spaces, and ends with a period. To match this (crude) defi- 
nition of a sentence, we could use a regular expression like this: 


“[[:upper:]][[:upper:][:lower:] ]*\. 


The expression consists of three items: a bracket expression contain- 
ing the [:upper:] character class, a bracket expression containing both the 
[:upper:] and [:lower:] character classes and a space, and a period escaped 
with a backslash. The second element is trailed with an * metacharacter 
so that after the leading uppercase letter in our sentence, any number of 
uppercase and lowercase letters and spaces may follow it and still match. 
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[me@linuxbox ~]$ echo "This works." | grep -E '*[[:upper:]][[:upper:][:lower:] ]*\.' 


This works. 


[me@linuxbox ~]$ echo "This Works." | grep -E '*[[:upper:]][[:upper:][:lower:] ]*\.' 


This Works. 


[me@linuxbox ~]$ echo "this does not" | grep -E '“[[:upper:]][[:upper:][:lower:] ]*\.' 


[me@linuxbox ~ 


] 


$ 
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The expression matches the first two tests, but not the third, since it 
lacks the required leading uppercase character and trailing period. 


+—Match an Element One or More Times 


The + metacharacter works much like the *, except it requires at least one 
instance of the preceding element to cause a match. Here is a regular expres- 
sion that will match only the lines consisting of groups of one or more alpha- 
betic characters separated by single spaces: 


“([[:alpha:]]+ ?)+$ 


Let’s try it. 


[me@linuxbox ~]$ echo "This that" | grep -E '*([[:alpha:]]+ ?)+$' 
This that 

[me@linuxbox ~]$ echo "a b c" | grep -E '*([[:alpha:]]+ ?)+$' 
abc 

[me@linuxbox ~]$ echo "a b 9" | grep -E '*([[:alpha:]]+ ?)+$' 
[me@linuxbox ~]$ echo "abc d" | grep -E '*([[:alpha:]]+ ?)+$' 
[me@linuxbox ~]$ 


We see that this expression does not match the line a b 9 because it 
contains a nonalphabetic character; nor does it match abc d because more 
than one space character separates the characters cand d. 


{ }—Match an Element a Specific Number of Times 


The { and } metacharacters are used to express minimum and maximum 
numbers of required matches. They may be specified in four possible ways, 
as outlined in Table 19-3. 


Table 19-3: Specifying the Number of Matches 


Specifier Meaning 


{n} Match the preceding element if it occurs exactly n times. 

{n,m} Match the preceding element if it occurs at least n times but no more than 
m times. 

{n,} Match the preceding element if it occurs n or more times. 

{,m} Match the preceding element if it occurs no more than m times. 


Going back to our earlier example with the phone numbers, we can 
use this method of specifying repetitions to simplify our original regular 
expression from the following: 


\(?[0-9][0-9][0-9]\)? [0-9][0-9][0-9]-[0-9][0-9][0-9][o0-9]$ 


to the following: 


\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$ 


Let’s try it. 


[me@linuxbox ~]$ 
(555) 123-4567 
[me@linuxbox ~]$ 
555 123-4567 
[me@linuxbox ~]$ 
[me@linuxbox ~]$ 


echo "(555) 123-4567" | grep -E '*\(?[0-9]{3}\)? [0-9]{3}-[0-9] {4}$' 
echo "555 123-4567" | grep -E '*\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$' 


echo "5555 123-4567" | grep -E '*\(?[0-9]{3}\)? [0-9]{3}-[0-9]{4}$' 


As we can see, our revised expression can successfully validate numbers 
both with and without the parentheses, while rejecting those numbers that 
are not properly formatted. 


Putting Regular Expressions to Work 


Let’s look at some of the commands we already know and see how they can 
be used with regular expressions. 


Validating a Phone List with grep 


In our earlier example, we looked at single phone numbers and checked 
them for proper formatting. A more realistic scenario would be checking 
a list of numbers instead, so let’s make a list. We'll do this by reciting a 
magical incantation to the command line. It will be magic because we 
have not covered most of the commands involved, but worry not. We will 
get there in future chapters. Here is the incantation. 


[me@linuxbox ~]$ for i in {1..10}; do echo "(${RANDOM:0:3}) ${RANDOM:0:3}-$ 
{RANDOM:0:4}" >> phonelist.txt; done 


This command will produce a file named phonelist.ixt containing 10 
phone numbers. Each time the command is repeated, another 10 numbers 
are added to the list. We can also change the value 10 near the beginning of 
the command to produce more or fewer phone numbers. If we examine the 
contents of the file, however, we see we have a problem. 


[me@linuxbox ~]$ cat phonelist.txt 
(232) 298-2265 
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(624) 381-1078 
(540) 126-1980 
(874) 163-2885 
(286) 254-2860 
(292) 108-518 

(129) 44-1379 

(458) 273-1642 
(686) 299-8268 
(198) 307-2440 


Some of the numbers are malformed, which is perfect for our purposes 
because we will use grep to validate them. 

One useful method of validation would be to scan the file for invalid 
numbers and display the resulting list. 


[me@linuxbox ~]$ grep -Ev '*\([0-9]{3}\) [0-9]{3}-[0-9]{4}$' phonelist.txt 
(292) 108-518 

(129) 44-1379 

[me@linuxbox ~]$ 


Here we use the -v option to produce an inverse match so that we will 
output only the lines in the list that do not match the specified expression. 
The expression itself includes the anchor metacharacters at each end to 
ensure that the number has no extra characters at either end. This expres- 
sion also requires that the parentheses be present in a valid number, unlike 
our earlier phone number example. 


Finding Ugly Filenames with find 


The find command supports a test based on a regular expression. There is an 
important consideration to keep in mind when using regular expressions in 
find versus grep. Whereas grep will print a line when the line contains a string 
that matches an expression, find requires that the pathname exactly match the 
regular expression. In the following example, we will use find with a regular 
expression to find every pathname that contains any character that is nota 
member of the following set: 


[-_./0-9a-zA-Z] 


Such a scan would reveal pathnames that contain embedded spaces and 
other potentially offensive characters. 


[me@linuxbox ~]$ find . -regex '.*[*-_./0-9a-zA-Z].*' 


Because of the requirement for an exact match of the entire pathname, 
we use .* at both ends of the expression to match zero or more instances 
of any character. In the middle of the expression, we use a negated bracket 
expression containing our set of acceptable pathname characters. 


Searching for Files with locate 


The locate program supports both basic (the --regexp option) and extended 
(the --regex option) regular expressions. With it, we can perform many of 
the same operations that we performed earlier with our dirlist files. 


[me@linuxbox ~]$ locate --regex 'bin/(bz|gz|zip) ' 
/bin/bzcat 
/bin/bzcmp 
/bin/bzdiff 
/bin/bzegrep 
/bin/bzexe 
/bin/bzfgrep 
/bin/bzgrep 
/bin/bzip2 
/bin/bzip2recover 
/bin/bzless 
/bin/bzmore 
/bin/gzexe 
/bin/gzip 
/usr/bin/zip 
/usr/bin/zipcloak 
/usr/bin/zipgrep 
/usr/bin/zipinfo 
/usr/bin/zipnote 
/usr/bin/zipsplit 


Using alternation, we perform a search for pathnames that contain 
either bin/bz, bin/gz, or /bin/zip. 


Searching for Text with less and vim 


less and vim both share the same method of searching for text. Pressing the 
/ key followed by a regular expression will perform a search. If we use less 
to view our phonelist.txt file, like so: 


[me@linuxbox ~]$ less phonelist.txt 


and then search for our validation expression, like this: 


(232) 298-2265 
(624) 381-1078 
(540) 126-1980 
(874) 163-2885 
(286) 254-2860 
(292) 108-518 

(129) 44-1379 

(458) 273-1642 
(686) 299-8268 
(198) 307-2440 
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/*\([0-9]{3}\) [0-9]{3}-[0-9]{4}$ 


less will highlight the strings that match, leaving the invalid ones easy 
to spot. 


298-2265 
381-1078 
126-1980 


163-2885 
254-2860 
(292) 108-518 
(129) 44-1379 


vim, on the other hand, supports basic regular expressions, so our 
search expression would look like this: 


/([0-9]\{3\}) [0-9]\{3\}-[0-9]\{4\} 


We can see that the expression is mostly the same; however, many of 
the characters that are considered metacharacters in extended expressions 
are considered literals in basic expressions. They are treated only as meta- 
characters when escaped with a backslash. Depending on the particular 
configuration of vim on our system, the matching will be highlighted. If 
not, try this command mode command to activate search highlighting: 


shlsearch 


Depending on your distribution, vim may or may not support text search highlighting. 
Ubuntu, in particular, supplies a stripped-down version of vim by default. On such 
systems, you may want to use your package manager to install a more complete ver- 
ston of vim. 


Summing Up 


In this chapter, we saw a few of the many uses of regular expressions. We 
can find even more if we use regular expressions to search for additional 
applications that use them. We can do that by searching the man pages. 


[me@linuxbox ~]$ cd /usr/share/man/man1 
[me@linuxbox man1]$ zgrep -El 'regex|regular expression’ *.gz 
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The zgrep program provides a front end for grep, allowing it to read 
compressed files. In our example, we search the compressed section 1 man 
page files in their usual location. The result of this command is a list of files 
containing either the string regex or the string regular expression. As we can 
see, regular expressions show up in a lot of programs. 

There is one feature found in basic regular expressions that we did 
not cover. Called back references, this feature will be discussed in the next 
chapter. 
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TEXT PROCESSING 


All Unix-like operating systems rely heavily 
on text files for data storage. So it makes 
sense that there are many tools for manipu- 


lating text. In this chapter, we will look at pro- 
grams that are used to “slice and dice” text. In the 
next chapter, we will look at more text processing, 
focusing on programs that are used to format text for 
printing and other kinds of human consumption. 


This chapter will revisit some old friends and introduce us to some 
new ones: 

cat Concatenate files and print on the standard output 

sort Sort lines of text files 

uniq Report or omit repeated lines 

cut Remove sections from each line of files 


paste Merge lines of files 
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join Join lines of two files on a common field 

comm Compare two sorted files line by line 

diff Compare files line by line 

patch Apply a diff file to an original 

tr Translate or delete characters 

sed Stream editor for filtering and transforming text 


aspell Interactive spellchecker 


Applications of Text 
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So far, we have learned a couple of text editors (nano and vim), looked at a 
bunch of configuration files, and have witnessed the output of dozens of 
commands, all in text. But what else is text used for? For many things, it 
turns out. 


Documents 


Many people write documents using plain text formats. While it is easy to 
see how a small text file could be useful for keeping simple notes, it is also 
possible to write large documents in text format. One popular approach 
is to write a large document in a text format and then embed a markup 
language to describe the formatting of the finished document. Many scien- 
tific papers are written using this method, as Unix-based text processing 
systems were among the first systems that supported the advanced typo- 
graphical layout needed by writers in technical disciplines. 


Web Pages 


The world’s most popular type of electronic document is probably the 
web page. Web pages are text documents that use either Hypertext Markup 
Language (HTML) or Extensible Markup Language (XML) as markup lan- 
guages to describe the document’s visual format. 


Email 


Email is an intrinsically text-based medium. Even non-text attachments 
are converted into a text representation for transmission. We can see this 
for ourselves by downloading an email message and then viewing it in less. 
We will see that the message begins with a header that describes the source 
of the message and the processing it received during its journey, followed 
by the body of the message with its content. 


Printer Output 


On Unix-like systems, output destined for a printer is sent as plain text or, 
if the page contains graphics, is converted into a text format page description 
language known as PostScript, which is then sent to a program that generates 
the graphic dots to be printed. 


Program Source Code 


Many of the command line programs found on Unix-like systems were cre- 
ated to support system administration and software development, and text 
processing programs are no exception. Many of them are designed to solve 
software development problems. The reason text processing is important 
to software developers is that all software starts out as text. Source code, 

the part of the program the programmer actually writes, is always in text 
format. 


Revisiting Some Old Friends 


Back in Chapter 6 we learned about some commands that are able to accept 
standard input in addition to command line arguments. We touched on 
them only briefly then, but now we will take a closer look at how they can 
be used to perform text processing. 


cat 


The cat program has a number of interesting options. Many of them are 
used to help better visualize text content. One example is the -A option, 
which is used to display non-printing characters in the text. There are 
times when we want to know whether control characters are embedded in 
our otherwise visible text. The most common of these are tab characters 
(as opposed to spaces) and carriage returns, often present as end-of-line 
characters in MS-DOS-style text files. Another common situation is a file 
containing lines of text with trailing spaces. 

Let’s create a test file using cat as a primitive word processor. To do 
this, we’ll just enter the command cat (along with specifying a file for redi- 
rected output) and type our text, followed by ENTER to properly end the line 
and then CTRL-D to indicate to cat that we have reached end of file. In this 
example, we enter a leading tab character and follow the line with some 
trailing spaces: 


[me@linuxbox ~]$ cat > foo.txt 
The quick brown fox jumped over the lazy dog. 
[me@linuxbox ~]$ 


Next, we use cat with the -A option to display the text. 


[me@linuxbox ~]$ cat -A foo.txt 
“IThe quick brown fox jumped over the lazy dog. $ 
[me@linuxbox ~]$ 


As we can see in the results, the tab character in our text is represented 
by “I. This is a common notation that means CTRL-I, which, as it turns out, 
is the same as a tab character. We also see that a $ appears at the true end 
of the line, indicating that our text contains trailing spaces. 
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MS-DOS TEXT VS. UNIX TEXT 


One of the reasons you may want to use cat to look for non-printing characters 
in text is to spot hidden carriage returns. Where do hidden carriage returns 
come from? DOS and Windows! Unix and DOS don’t define the end of a line 
the same way in text files. Unix ends a line with a linefeed character (ASCII 10), 
while MS-DOS and its derivatives use the sequence carriage return (ASCII 13) 
and linefeed to terminate each line of text. 

There are several ways to convert files from DOS to Unix format. On many 
Linux systems, there are programs called dos2unix and unix2dos, which can con- 
vert text files to and from DOS format. However, if you don't have dos2unix on 
your system, don't worry. The process of converting text from DOS to Unix format 
is simple; it involves the removal of the offending carriage returns. That is easily 


accomplished by a couple of the programs discussed later in this chapter. 


cat also has options that are used to modify text. The two most promi- 
nent are -n, which numbers lines, and -s, which suppresses the output of 
multiple blank lines. We can demonstrate thusly: 


[me@linuxbox ~]$ cat > foo.txt 
The quick brown fox 


jumped over the lazy dog. 
[me@linuxbox ~]$ cat -ns foo.txt 


1 The quick brown fox 
2 
3 jumped over the lazy dog. 


[me@linuxbox ~]$ 


In this example, we create a new version of our /foo.txt test file, which 
contains two lines of text separated by two blank lines. After processing by 
cat with the -ns options, the extra blank line is removed, and the remaining 
lines are numbered. While this is not much of a process to perform on text, 
it is a process. 


sort 


The sort program sorts the contents of standard input, or one or more files 
specified on the command line, and sends the results to standard output. 
Using the same technique that we used with cat, we can demonstrate pro- 
cessing of standard input directly from the keyboard as follows: 


me@linuxbox ~]$ sort > foo.txt 


9s on 


me@linuxbox ~]$ cat foo.txt 


Oa CT Yr 


After entering the command, we enter the letters c, 6, and a, and then 
we press CTRL-D to indicate end of file. We then view the resulting file and 
see that the lines now appear in sorted order. 

Because sort can accept multiple files on the command line as argu- 
ments, it is possible to merge multiple files into a single sorted whole. For 
example, if we had three text files and wanted to combine them into a 
single sorted file, we could do something like this: 


sort file1.txt file2.txt file3.txt > final_sorted_list.txt 


sort has several interesting options. Table 20-1 provides a partial list. 


Table 20-1: Common sort Options 


Option _ Long option 


-b --ignore-leading-blanks 
-f --ignore-case 

-n --numeric-sort 

-r --reverse 

-k --key=field1[, field2] 
-m --merge 

-0 --output=file 

-t --field-separator=char 


Description 


By default, sorting is performed on the entire 
line, starting with the first character in the line. 
This option causes sort to ignore leading 
spaces in lines and calculates sorting based on 
the first non-whitespace character on the line. 


Make sorting case-insensitive. 


Perform sorting based on the numeric evalua- 
tion of a string. Using this option allows sorting 
to be performed on numeric values rather than 
alphabetic values. 


Sort in reverse order. Results are in descending 
rather than ascending order. 


Sort based on a key field located from field1 
to field2 rather than the entire line. See the 
following discussion. 

Treat each argument as the name of a presorted 
file. Merge multiple files into a single sorted 
result without performing any additional sorting. 


Send sorted output to file rather than standard 
output. 

Define the field-separator character. By default 
fields are separated by spaces or tabs. 


Although most of these options are pretty self-explanatory, some are not. 
First, let’s look at the -n option, used for numeric sorting. With this option, 
it is possible to sort values based on numeric values rather than lexographi- 
cally. We can demonstrate this by sorting the results of the du command to 
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determine the largest users of disk space. Normally, the du command lists the 
results of a summary in pathname order. 


[me@linuxbox ~]$ du -s /usr/share/* | head 


252 /usr/share/aclocal 

96 /usr/share/acpi- support 

8 /usr/share/adduser 

196 /usr/share/alacarte 

344 /usr/share/alsa 

8 /usr/share/alsa-base 

12488 /usr/share/anthy 

8 /usr/share/apmdq 

21440 /usr/share/app-install 

48 /usr/share/application-registry 


In this example, we pipe the results into head to limit the results to the 
first 10 lines. We can produce a numerically sorted list to show the 10 largest 
consumers of space this way. 


[me@linuxbox ~]$ du -s /usr/share/* | sort -nr | head 
509940 /usr/share/locale-langpack 
242660 /usx/share/doc 

197560 /usx/share/fonts 

179144 /usr/share/gnome 

146764 /usr/share/myspell 

144304 /usr/share/gimp 

135880 /usx/share/dict 

76508 /usx/share/icons 

68072 /usr/share/apps 

62844 /usr/share/foomatic 


By using the n and r options, we produce a reverse numerical sort, with 
the largest values appearing first in the results. This sort works because the 
numerical values occur at the beginning of each line. But what if we want to 
sort a list based on some value found within the line? For example, here are 
the results of 1s -1: 


[me@linuxbox ~]$ 1s -1 /usr/bin | head 
total 152948 


-Ywxr-xr-xX 1 root root 34824 2016-04-04 02:42 [ 

-IWxr-xXr-x 1 root root 101556 2007-11-27 06:08 a2p 
-IwXI-Xr-xX 1 root root 13036 2016-02-27 08:22 aconnect 
-IwXI-xr-x 1 root root 10552 2007-08-15 10:34 acpi 
-Iwxr-xr-x 1 root root 3800 2016-04-14 03:51 acpi_fakekey 
-IwWxY-xr-x 1 root root 7536 2016-04-19 00:19 acpi_listen 
-IWxY-xr-X 1 root root 3576 2016-04-29 07:57 addpart 
-YWXI-xr-x 1 root root 20808 2016-01-03 18:02 addr2line 
-IWxY-xr-X 1 root root 489704 2016-10-09 17:02 adept_batch 


Ignoring, for the moment, that 1s can sort its results by size, we could 
use sort to sort this list by file size, as well. 


[me@linuxbox ~]$ 1s -1 /usr/bin | sort -nrk 5 | head 


-IWxY-xr-X 1 root root 8234216 2016-04-07 17:42 inkscape 
-Iwxr-xr-x 1 root root 8222692 2016-04-07 17:42 inkview 
-IWxr-xr-x 1 root root 3746508 2016-03-07 23:45 gimp-2.4 
-IwXI-xXr-xX 1 root root 3654020 2016-08-26 16:16 quanta 
-IWxY-xr-X 1 root root 2928760 2016-09-10 14:31 gdbtui 
-IWxY-xr-X 1 root root 2928756 2016-09-10 14:31 gdb 
-IwXY-xXr-xX 1 root root 2602236 2016-10-10 12:56 net 
-IWxY-xr-xX 1 root root 2304684 2016-10-10 12:56 rpcclient 
-YWxr-xr-x 1 root root 2241832 2016-04-04 05:56 aptitude 
-IWXI-xr-x 1 root root 2202476 2016-10-10 12:56 smbcacls 


Many uses of sort involve the processing of tabular data, such as the 
results of the previous 1s command. If we apply database terminology to 
the previous table, we would say that each row is a record and that each 
record consists of multiple fields, such as the file attributes, link count, 
filename, file size, and so on. sort is able to process individual fields. In 
database terms, we are able to specify one or more key fields to use as sort 
keys. In the previous example, we specify the n and r options to perform a 
reverse numerical sort and specify -k 5 to make sort use the fifth field as 
the key for sorting. 

The k option is interesting and has many features, but first we need to 
talk about how sort defines fields. Let’s consider the following simple text 
file consisting of a single line containing the author’s name: 


William Shotts 


By default, sort sees this line as having two fields. The first field con- 
tains these characters: "William". The second field contains these characters: 
"Shotts". 

This means that whitespace characters (spaces and tabs) are used as 
delimiters between fields and that the delimiters are included in the field 
when sorting is performed. 

Looking again at a line from our Is output, as follows, we can see that a 
line contains eight fields and that the fifth field is the file size: 


-IwXr-xr-x 1 root root 8234216 2016-04-07 17:42 inkscape 


For our next series of experiments, let’s consider the following file con- 
taining the history of three popular Linux distributions released from 2006 
to 2008. Each line in the file has three fields: the distribution name, version 
number, and date of release in MM/DD/YYYY format. 


SUSE 10.2 12/07/2006 
Fedora 10 11/25/2008 
SUSE 11.0 06/19/2008 
Ubuntu 8.04 04/24/2008 
Fedora 8 11/08/2007 
SUSE 10.3 10/04/2007 
Ubuntu 6.10 10/26/2006 
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Fedora 7 05/31/2007 


Ubuntu 7.10 10/18/2007 
Ubuntu 7.04 04/19/2007 
SUSE 10.1 05/11/2006 
Fedora 6 10/24/2006 
Fedora 9 05/13/2008 
Ubuntu 6.06 06/01/2006 
Ubuntu 8.10 10/30/2008 
Fedora 5 03/20/2006 


Using a text editor (perhaps vim), we’ll enter this data and name the 
resulting file distros. txt. 
Next, we’ll try sorting the file and observe these results: 


[me@linuxbox ~]$ sort distros.txt 


Fedora 10 11/25/2008 
Fedora 5 03/20/2006 
Fedora 6 10/24/2006 
Fedora 7 05/31/2007 
Fedora 8 11/08/2007 
Fedora 9 05/13/2008 
SUSE 10.1 05/11/2006 
SUSE 10.2 12/07/2006 
SUSE 10.3 10/04/2007 
SUSE 11.0 06/19/2008 
Ubuntu 6.06 06/01/2006 
Ubuntu 6.10 10/26/2006 
Ubuntu 7.04 04/19/2007 
Ubuntu 7.10 10/18/2007 
Ubuntu 8.04 04/24/2008 
Ubuntu 8.10 10/30/2008 


Well, it mostly worked. The problem occurs in the sorting of the Fedora 
version numbers. Because 1 comes before 5 in the character set, version 10 
ends up at the top while version 9 falls to the bottom. 

To fix this problem, we are going to have to sort on multiple keys. We 
want to perform an alphabetic sort on the first field and then a numeric 
sort on the second field. sort allows multiple instances of the -k option so 
that multiple sort keys can be specified. In fact, a key may include a range 
of fields. If no range is specified (as has been the case with our previous 
examples), sort uses a key that begins with the specified field and extends 
to the end of the line. Here is the syntax for our multikey sort: 


[me@linuxbox ~]$ sort --key=1,1 --key=2n distros.txt 


Fedora 5 03/20/2006 
Fedora 6 10/24/2006 
Fedora 7 05/31/2007 
Fedora 8 11/08/2007 
Fedora 9 05/13/2008 
Fedora 10 11/25/2008 
SUSE 10.1 05/11/2006 
SUSE 10.2 12/07/2006 


SUSE 10.3 10/04/2007 


SUSE 11.0 06/19/2008 
Ubuntu 6.06 06/01/2006 
Ubuntu 6.10 10/26/2006 
Ubuntu 7.04 04/19/2007 
Ubuntu 7.10 10/18/2007 
Ubuntu 8.04 04/24/2008 
Ubuntu 8.10 10/30/2008 


Though we used the long form of the option for clarity, -k 1,1 -k 2n 
would be exactly equivalent. In the first instance of the key option, we speci- 
fied a range of fields to include in the first key. Because we wanted to limit 
the sort to just the first field, we specified 1,1, which means “start at field 
l and end at field 1.” In the second instance, we specified 2n, which means 
field 2 is the sort key and that the sort should be numeric. An option letter 
may be included at the end of a key specifier to indicate the type of sort to 
be performed. These option letters are the same as the global options for 
the sort program: b (ignore leading blanks), n (numeric sort), r (reverse 
sort), and so on. 

The third field in our list contains a date in an inconvenient format for 
sorting. On computers, dates are usually formatted in YYYY-MM-DD order 
to make chronological sorting easy, but ours are in the American format of 
MM/DD/YYYY. How can we sort this list in chronological order? 

Fortunately, sort provides a way. The key option allows specification of 
offsets within fields, so we can define keys within fields. 


[me@linuxbox ~]$ sort -k 3.7nbr -k 3.1nbr -k 3.4nbr distros.txt 


Fedora 10 11/25/2008 
Ubuntu 8.10 10/30/2008 
SUSE 11.0 06/19/2008 
Fedora 9 05/13/2008 
Ubuntu 8.04 04/24/2008 
Fedora 8 11/08/2007 
Ubuntu 7.10 10/18/2007 
SUSE 10.3 10/04/2007 
Fedora 7 05/31/2007 
Ubuntu 7.04 04/19/2007 
SUSE 10.2 12/07/2006 
Ubuntu 6.10 10/26/2006 
Fedora 6 10/24/2006 
Ubuntu 6.06 06/01/2006 
SUSE 10.1 05/11/2006 
Fedora 5 03/20/2006 


By specifying -k 3.7, we instruct sort to use a sort key that begins at the 
seventh character within the third field, which corresponds to the start of 
the year. Likewise, we specify -k 3.1 and -k 3.4 to isolate the month and day 
portions of the date. We also add the n and r options to achieve a reverse 
numeric sort. The b option is included to suppress the leading spaces (whose 
numbers vary from line to line, thereby affecting the outcome of the sort) 
in the date field. 
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Some files don’t use tabs and spaces as field delimiters; for example, 
here’s the /etc/passwd file: 


[me@linuxbox ~]$ head /etc/passwd 
root:x:0:0:root:/root:/bin/bash 
daemon:x:1:1:daemon:/usr/sbin:/bin/sh 
bin:x:2:2:bin:/bin:/bin/sh 
sy$:X:3:3:sys:/dev:/bin/sh 
sync:X:4:65534:sync:/bin:/bin/sync 
games :x:5:60:games:/usr/games:/bin/sh 
man:x:6:12:man:/var/cache/man:/bin/sh 
1p:x:7:7:lp:/var/spool/1pd:/bin/sh 
mail:x:8:8:mail:/var/mail:/bin/sh 
news :X:9:9:news:/var/spool/news:/bin/sh 


The fields in this file are delimited with colons (:), so how would we 
sort this file using a key field? sort provides the -t option to define the 
field separator character. To sort the passwd file on the seventh field (the 
account’s default shell), we could do this: 


[me@linuxbox ~]$ sort -t ':' -k 7 /etc/passwd | head 
me:X:1001:1001:Myself, ,,:/home/me:/bin/bash 
root:x:0:0:root:/root:/bin/bash 
dhcp:x:101:102::/nonexistent:/bin/false 

gdm:x:106:114:Gnome Display Manager: /var/lib/gdm:/bin/false 
hplip:x:104:7:HPLIP system user,,,:/var/run/hplip:/bin/false 
klog:x:103:104: :/home/klog:/bin/false 

messagebus :x:108:119: :/var/run/dbus:/bin/false 
polkituser:x:110:122:PolicyKit, ,,:/var/run/PolicyKit:/bin/false 
pulse:x:107:116:PulseAudio daemon, ,,:/var/run/pulse:/bin/false 


By specifying the colon character as the field separator, we can sort on 
the seventh field. 


uniq 

Compared to sort, the uniq program is lightweight. uniq performs a seem- 
ingly trivial task. When given a sorted file (or standard input), it removes 
any duplicate lines and sends the results to standard output. It is often used 
in conjunction with sort to clean the output of duplicates. 


TIP While unig is a traditional Unix tool often used with sort, the GNU version of sort 
supports a-u option, which removes duplicates from the sorted output. 


Let’s make a text file to try this, as shown here: 


me@linuxbox ~]$ cat > foo.txt 


[ 
a 
b 
c 
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s 


Remember to type CTRL-D to terminate standard input. Now, if we run 
unig on our text file, we get this: 


me@linuxbox ~]$ uniq foo.txt 


[ 
a 
b 
c 
a 
b 
c 


The results are no different from our original file; the duplicates were 
not removed. For unig to do its job, the input must be sorted first. 


me@linuxbox ~]$ sort foo.txt | uniq 


OT Ym 


This is because unig only removes duplicate lines that are adjacent to 
each other. 
uniq has several options. Table 20-2 lists the common ones. 


Table 20-2: Common unig Options 


Option _Long option Description 

-C --count Output a list of duplicate lines preceded by the number 
of times the line occurs. 

-d --repeated Output only repeated lines, rather than unique lines. 

-fn --skip-fields=n Ignore n leading fields in each line. Fields are separated 


by whitespace as they are in sort; however, unlike sort, 
unig has no option for setting an alternate field separator. 


-i --ignore-case Ignore case during the line comparisons. 

-sn --skip-chars=n Skip (ignore) the leading n characters of each line. 

-u --unique Output only unique lines. Lines with duplicates are 
ignored. 


Here we see unig used to report the number of duplicates found in our 
text file, using the -c option: 


[me@linuxbox ~]$ sort foo.txt | uniq -c 
2a 
2b 
2 
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The next three programs we will discuss are used to peel columns of text 
out of files and recombine them in useful ways. 


cut—Remove Sections from Each Line of Files 


The cut program is used to extract a section of text from a line and output 
the extracted section to standard output. It can accept multiple file argu- 
ments or input from standard input. 

Specifying the section of the line to be extracted is somewhat awkward 
and is specified using the options listed in Table 20-3. 


Table 20-3: cut Selection Options 
Option Long option Description 


-c list --characters=list Extract the portion of the line defined by list. The 
list may consist of one or more comma-separated 
numerical ranges. 


-f list --fields=list Extract one or more fields from the line as defined by 
list. The list may contain one or more fields or field 
ranges separated by commas. 


-d delim --delimiter=delim | When -f is specified, use delim as the field delimit- 
ing character. By default, fields must be separated by 
a single tab character. 


--complement Extract the entire line of text, except for those portions 
specified by -c and/or -f. 


As we can see, the way cut extracts text is rather inflexible. cut is best 
used to extract text from files that are produced by other programs, rather 
than text directly typed by humans. We’ll take a look at our distros. txt file to 
see whether it is “clean” enough to be a good specimen for our cut examples. 
If we use cat with the -A option, we can see whether the file meets our 
requirements of tab-separated fields. 


[me@linuxbox ~]$ cat -A distros.txt 
SUSE*I10.2%112/07/2006$ 
Fedora*I10%111/25/2008$ 
SUSE*I11.0°106/19/2008$ 
Ubuntu*18.04*104/24/2008$ 
Fedora*18*111/08/2007$ 
SUSE*110.3%110/04/2007$ 
Ubuntu’16.10*110/26/2006$ 
Fedora*I7*105/31/2007$ 
Ubuntu*17.10°110/18/2007$ 
Ubuntu*I7.04*104/19/2007$ 
SUSE*I10.1*105/11/2006$ 
Fedora*16*110/24/2006$ 
Fedora*19*105/13/2008$ 


Ubuntu*16.06*106/01/2006$ 
Ubuntu*18.10*110/30/2008$ 
Fedora*15*103/20/2006$ 


It looks good. There are no embedded spaces, just single tab characters 
between the fields. Because the file uses tabs rather than spaces, we’ll use 
the -f option to extract a field. 


[me@linuxbox ~]$ cut -f 3 distros.txt 
12/07/2006 
11/25/2008 
06/19/2008 
04/24/2008 
11/08/2007 
10/04/2007 
10/26/2006 
05/31/2007 
10/18/2007 
04/19/2007 
05/11/2006 
10/24/2006 
05/13/2008 
06/01/2006 
10/30/2008 
03/20/2006 


Because our distros file is tab-delimited, it is best to use cut to extract 
fields rather than characters. This is because when a file is tab-delimited, it 
is unlikely that each line will contain the same number of characters, which 
makes calculating character positions within the line difficult or impossible. 
In our previous example, however, we now have extracted a field that luckily 
contains data of identical length, so we can show how character extraction 
works by extracting the year from each line. 


[me@linuxbox ~]$ cut -f 3 distros.txt | cut -c 7-10 
2006 
2008 
2008 
2008 
2007 
2007 
2006 
2007 
2007 
2007 
2006 
2006 
2008 
2006 
2008 
2006 
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By running cut a second time on our list, we are able to extract character 
positions 7 through 10, which corresponds to the year in our date field. The 
7-10 notation is an example of a range. The cut man page contains a complete 
description of how ranges can be specified. 

When working with fields, it is possible to specify a different field delim- 
iter rather than the tab character. Here we will extract the first field from 
the /etc/passwd file: 


[me@linuxbox ~]$ cut -d ':' -f 1 /etc/passwd | head 
root 

daemon 

bin 


Using the -d option, we are able to specify the colon character as the 
field delimiter. 


EXPANDING TABS 


Our distros.txt file is ideally formatted for extracting fields using cut. But what if 
we wanted a file that could be fully manipulated with cut by characters, rather 
than fields? This would require us to replace the tab characters within the file with 
the corresponding number of spaces. Fortunately, the GNU Coreutils package 
includes a tool for that. Named expand, this program accepts either one or more 


file arguments or standard input and outputs the modified text to standard output. 


If we process our distros. txt file with expand, we can use cut -c to extract 
any range of characters from the file. For example, we could use the following 
command to extract the year of release from our list by expanding the file and 
using cut to extract every character from the 23rd position to the end of the line: 


[me@linuxbox ~]$ expand distros.txt | cut -c 23- 


Coreutils also provides the unexpand program to substitute tabs for spaces. 


paste—Merge Lines of Files 


The paste command does the opposite of cut. Rather than extracting a 
column of text from a file, it adds one or more columns of text to a file. 
It does this by reading multiple files and combining the fields found in 
each file into a single stream on standard output. Like cut, paste accepts 


multiple file arguments and/or standard input. To demonstrate how paste 
operates, we will perform some surgery on our distros.ixt file to produce a 
chronological list of releases. 

From our earlier work with sort, we will first produce a list of distros 
sorted by date and store the result in a file called distros-by-date.txt. 


[me@linuxbox ~]$ 


sort -k 3.7nbr -k 3.1nbr -k 3.4nbr distros.txt > distros-by-date.txt 


Next, we will use cut to extract the first two fields from the file (the distro 
name and version) and store that result in a file named distro-versions.txt. 


[me@linuxbox ~]$ cut -f 1,2 distros-by-date.txt > distros-versions.txt 
[me@linuxbox ~]$ head distros-versions.txt 


Fedora 10 
Ubuntu 8.10 
SUSE 11.0 


Fedora 9 
Ubuntu 8.04 
Fedora 8 
Ubuntu 7.10 
SUSE 10.3 
Fedora 7 
Ubuntu 7.04 


The final piece of preparation is to extract the release dates and store 
them in a file named distro-dates.txt. 


[me@linuxbox ~]$ cut -f 3 distros-by-date.txt > distros-dates.txt 
[me@linuxbox ~]$ head distros-dates.txt 
11/25/2008 

10/30/2008 

06/19/2008 

05/13/2008 

04/24/2008 

11/08/2007 

10/18/2007 

10/04/2007 

05/31/2007 

04/19/2007 


We now have the parts we need. To complete the process, use paste to 
put the column of dates ahead of the distro names and versions, thus creat- 
ing a chronological list. This is done simply by using paste and ordering its 
arguments in the desired arrangement. 


[me@linuxbox ~]$ paste distros-dates.txt distros-versions.txt 
11/25/2008 Fedora 10 


10/30/2008 Ubuntu 8.10 
06/19/2008 SUSE 11.0 
05/13/2008 Fedora 9 
04/24/2008 Ubuntu 8.04 
11/08/2007 Fedora 8 
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10/18/2007 Ubuntu 7.10 


10/04/2007 SUSE 10.3 
05/31/2007 Fedora 7 
04/19/2007 Ubuntu 7.04 
12/07/2006 SUSE 10.2 
10/26/2006 Ubuntu 6.10 
10/24/2006 Fedora 6 
06/01/2006 Ubuntu 6.06 
05/11/2006 SUSE 10.1 
03/20/2006 Fedora 5 


join—Join Lines of Two Files on a Common Field 


In some ways, join is like paste in that it adds columns to a file, but it uses a 
unique way to do it. A join is an operation usually associated with relational 
databases where data from multiple tables with a shared key field is combined 
to form a desired result. The join program performs the same operation. It 
joins data from multiple files based on a shared key field. 

To see how a join operation is used in a relational database, let’s imagine 
a small database consisting of two tables, each containing a single record. 
The first table, called CUSTOMERS, has three fields: a customer number 
(CUSTNUM), the customer’s first name (FNAME), and the customer’s last 
name (LNAME): 


CUSTNUM FNAME LNAME 


4681934 John Smith 


The second table is called ORDERS and contains four fields: an order 
number (ORDERNUM), the customer number (CUSTNUM), the quantity 
(QUAN), and the item ordered (ITEM). 


ORDERNUM CUSTNUM QUAN ITEM 


3014953305 4681934 1 Blue Widget 


Note that both tables share the field CUSTNUM. This is important 
because it allows a relationship between the tables. 

Performing a join operation would allow us to combine the fields in the 
two tables to achieve a useful result, such as preparing an invoice. Using the 
matching values in the CUSTNUM fields of both tables, a join operation 
could produce the following: 


FNAME LNAME QUAN ITEM 


John Smith 1 Blue Widget 


To demonstrate the join program, we’ll need to make a couple of files 
with a shared key. To do this, we will use our distros-by-date.txt file. From this 


file, we will construct two additional files. One contains the release dates 


(which will be our shared key for this demonstration) and the release names, 
as shown here: 


[me@linuxbox ~]$ cut -f 1,1 distros-by-date.txt > distros-names.txt 


[me@linuxbox ~]$ paste distros-dates.txt distros-names.txt > distros-key-names.txt 


[me@linuxbox ~]$ head distros-key-names.txt 


11/25/2008 
10/30/2008 
06/19/2008 
05/13/2008 
04/24/2008 
11/08/2007 
10/18/2007 
10/04/2007 
05/31/2007 
04/19/2007 


Fedora 
Ubuntu 
SUSE 

Fedora 
Ubuntu 
Fedora 
Ubuntu 
SUSE 

Fedora 
Ubuntu 


The second file contains the release dates and the version numbers, as 
shown here. 


[me@linuxbox ~]$ cut -f 2,2 distros-by-date.txt > distros-vernums.txt 


[me@linuxbox ~]$ paste distros-dates.txt distros-vernums.txt > distros-key-vernums.txt 


[me@linuxbox ~]$ head distros-key-vernums.txt 


11/25/2008 
10/30/2008 
06/19/2008 
05/13/2008 
04/24/2008 
11/08/2007 
10/18/2007 
10/04/2007 
05/31/2007 
04/19/2007 


10 
8.10 
11.0 
9 
8.04 
8 
7.10 
10.3 
7 
7.04 


We now have two files with a shared key (the “release date” field). It is 
important to point out that the files must be sorted on the key field for join 
to work properly. 


[me@linuxbox ~]$ join distros-key-names.txt distros-key-vernums.txt | head 


11/25/2008 
10/30/2008 
06/19/2008 
05/13/2008 
04/24/2008 
11/08/2007 
10/18/2007 
10/04/2007 
05/31/2007 
04/19/2007 


Fedora 10 
Ubuntu 8.10 
SUSE 11.0 
Fedora 9 
Ubuntu 8.04 
Fedora 8 
Ubuntu 7.10 
SUSE 10.3 
Fedora 7 
Ubuntu 7.04 
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Note also that, by default, join uses whitespace as the input field delim- 
iter and a single space as the output field delimiter. This behavior can be 
modified by specifying options. See the join man page for details. 


Comparing Text 
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It is often useful to compare versions of text files. For system administrators 
and software developers, this is particularly important. A system administra- 
tor may, for example, need to compare an existing configuration file to a 
previous version to diagnose a system problem. Likewise, a programmer fre- 
quently needs to see what changes have been made to programs over time. 


comm—Compare Two Sorted Files Line by Line 


The comm program compares two text files and displays the lines that are 
unique to each one and the lines they have in common. To demonstrate, 
we will create two nearly identical text files using cat. 


me@linuxbox ~]$ cat > file1.txt 


[ 
a 
b 
c 
d 
[me@linuxbox ~]$ cat > file2.txt 
b 
c 
d 
e 


Next, we will compare the two files using comm. 


[me@linuxbox ~]$ comm file1.txt file2.txt 
a 


As we can see, comm produces three columns of output. The first column 
contains lines unique to the first file argument, the second column contains 
the lines unique to the second file argument, and the third column con- 
tains the lines shared by both files. comm supports options in the form -n, 
where nis either 1, 2, or 3. When used, these options specify which columns 
to suppress. For example, if we wanted to output only the lines shared by 
both files, we would suppress the output of the first and second columns. 


me@linuxbox ~]$ comm -12 filei.txt file2.txt 


[ 
b 
Cc 
d 


diff—Compare Files Line by Line 


Like the comm program, diff is used to detect the differences between files. 
However, diff is a much more complex tool, supporting many output for- 
mats and the ability to process large collections of text files at once. diff 
is often used by software developers to examine changes between differ- 
ent versions of program source code and thus has the ability to recursively 
examine directories of source code, often referred to as source trees. One 
common use for diff is the creation of diff files or patches that are used by 
programs such as patch (which we'll discuss shortly) to convert one version 
of a file (or files) to another version. 

If we use diff to look at our previous example files: 


[me@linuxbox ~]$ diff file1.txt file2.txt 
1do 
<a 
4a4 
>e 


we see its default style of output: a terse description of the differences 
between the two files. In the default format, each group of changes is pre- 
ceded by a change command in the form of range operation range to describe 
the positions and types of changes required to convert the first file to the 
second file, as outlined in Table 20-4. 


Table 20-4: diff Change Commands 
Change _ Description 


riar2 Append the lines at the position r2 in the second file to the position r2 in 
the first file. 
r1cr2 Change (replace) the lines at position r1 with the lines at the position r2 in 


the second file. 


r1dr2 Delete the lines in the first file at position r1, which would have appeared 
at range 12 in the second file 


In this format, a range is a comma-separated list of the starting line and 
the ending line. While this format is the default (mostly for POSIX compli- 
ance and backward compatibility with traditional Unix versions of diff), it 
is not as widely used as other, optional formats. Two of the more popular 
formats are the context format and the unified format. 

When viewed using the context format (the -c option), we will see this: 


[me@linuxbox ~]$ diff -c filei.txt file2.txt 
*** filet.txt 2008-12-23 06:40:13.000000000 -0500 


--- file2.txt 2008-12-23 06:40:34.000000000 -0500 
Fe KCK 


RK 4 be 
- a 
b 
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The output begins with the names of the two files and their timestamps. 
The first file is marked with asterisks, and the second file is marked with 
dashes. Throughout the remainder of the listing, these markers will signify 
their respective files. Next, we see groups of changes, including the default 
number of surrounding context lines. In the first group, we see this: 


KK 1,4 KK 


which indicates lines 1 through 4 in the first file. Later we see this: 


reas 1,4 iasoesites 


which indicates lines 1 through 4 in the second file. Within a change group, 
lines begin with one of the four indicators described in Table 20-5. 


Table 20-5: diff Context Format Change Indicators 


Indicator Meaning 


blank A line shown for context. It does not indicate a difference between the 
two files. 


- A line deleted. This line will appear in the first file but not in the second file. 
+ A line added. This line will appear in the second file but not in the first file. 


; A line changed. The two versions of the line will be displayed, each in its 
respective section of the change group. 


The unified format is similar to the context format but is more concise. 
It is specified with the -u option. 


[me@linuxbox ~]$ diff -u filei.txt file2.txt 

--- filei.txt 2008-12-23 06:40:13.000000000 -0500 
+++ file2.txt 2008-12-23 06:40:34.000000000 -0500 
@@ -1,4 +1,4 @@ 

-a 

b 

c 

d 

+e 


The most notable difference between the context and unified formats is 
the elimination of the duplicated lines of context, making the results of the 
unified format shorter than those of the context format. In our previous 
example, we see file timestamps like those of the context format, followed 
by the string @@ -1,4 +1,4 @@. This indicates the lines in the first file and the 
lines in the second file described in the change group. Following this are 
the lines themselves, with the default three lines of context. Each line starts 
with one of the three possible characters described in Table 20-6. 


Table 20-6: diff Unified Format Change Indicators 


Character Meaning 
blank This line is shared by both files. 
- This line was removed from the first file. 


+ This line was added to the first file. 


patch—Apply a diff to an Original 


The patch program is used to apply changes to text files. It accepts output 
from diff and is generally used to convert older-version files into newer ver- 
sions. Let’s consider a famous example. The Linux kernel is developed by a 
large, loosely organized team of contributors who submit a constant stream 
of small changes to the source code. The Linux kernel consists of several 
million lines of code, while the changes that are made by one contributor 
at one time are quite small. It makes no sense for a contributor to send each 
developer an entire kernel source tree each time a small change is made. 
Instead, a diff file is submitted. The diff file contains the change from the 
previous version of the kernel to the new version with the contributor’s 
changes. The receiver then uses the patch program to apply the change to 
their own source tree. Using diff/patch offers two significant advantages. 


e ©The diff file is small, compared to the full size of the source tree. 


e =6The diff file concisely shows the change being made, allowing reviewers 
of the patch to quickly evaluate it. 


Of course, diff/patch will work on any text file, not just source code. It 
would be equally applicable to configuration files or any other text. 

To prepare a diff file for use with patch, the GNU documentation 
suggests using diff as follows: 


diff -Naur old file new_file > diff_file 


where old_file and new_file are either single files or directories containing 
files. The r option supports recursion of a directory tree. 
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Once the diff file has been created, we can apply it to patch the old file 
into the new file. 


patch < diff_file 


We’ll demonstrate with our test file. 


[me@linuxbox ~]$ diff -Naur file1.txt file2.txt > patchfile.txt 
[me@linuxbox ~]$ patch < patchfile.txt 

patching file file1.txt 

[me@linuxbox ~]$ cat file1.txt 


oanaAI90 


In this example, we created a diff file named paitch/file.ixt and then used 
the patch program to apply the patch. Note that we did not have to specify 
a target file to patch, as the diff file (in unified format) already contains the 
filenames in the header. Once the patch is applied, we can see that /filel.txt 
now matches /ile2.txt. 

patch has a large number of options, and there are additional utility 
programs that can be used to analyze and edit patches. 
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Our experience with text editors has been largely interactive, meaning that 
we manually move a cursor around and then type our changes. However, 
there are non-interactive ways to edit text as well. It’s possible, for example, 
to apply a set of changes to multiple files with a single command. 


tr—Transliterate or Delete Characters 


The tr program is used to transliterate characters. We can think of this as a 
sort of character-based search-and-replace operation. Transliteration is the 
process of changing characters from one alphabet to another. For example, 
converting characters from lowercase to uppercase is transliteration. We 
can perform such a conversion with tr as follows: 


[me@linuxbox ~]$ echo "lowercase letters" | tr a-z A-Z 
LOWERCASE LETTERS 


As we can see, tr operates on standard input and outputs its results on 
standard output. tr accepts two arguments: a set of characters to convert 
from and a corresponding set of characters to convert to. Character sets 
may be expressed in one of three ways. 


e An enumerated list. For example, ABCDEFGHIJKLMNOPORSTUVWXYZ. 


e Acharacter range. For example, A-Z. Note that this method is some- 
times subject to the same issues as other commands because of the 
locale collation order and thus should be used with caution. 


e POSIX character classes. For example, [:upper: ]. 


In most cases, both character sets should be of equal length; however, 
it is possible for the first set to be larger than the second, particularly if we 
want to convert multiple characters to a single character. 


[me@linuxbox ~]$ echo "lowercase letters" | tr [:lower:] A 
AAAAAAAAA AAAAAAA 


In addition to transliteration, tr allows characters to simply be deleted 
from the input stream. Earlier in this chapter, we discussed the problem of 
converting MS-DOS text files to Unix-style text. To perform this conversion, 
carriage return characters need to be removed from the end of each line. 
This can be performed with tr as follows: 


tr -d '\r' < dos_file > unix_file 


where dos_file is the file to be converted and unix_file is the result. This form 
of the command uses the escape sequence \r to represent the carriage return 
character. To see a complete list of the sequences and character classes tr 
supports, try the following: 


[me@linuxbox ~]$ tr --help 


tr can perform another trick, too. Using the -s option, tr can “squeeze” 
(delete) repeated instances of a character. 


[me@linuxbox ~]$ echo "aaabbbccc" | tr -s ab 
abccc 


Here we have a string containing repeated characters. By specifying 
the set ab to tr, we eliminate the repeated instances of the letters in the 
set, while leaving the character that is missing from the set (c) unchanged. 
Note that the repeating characters must be adjoining. If they are not, the 
squeezing will have no effect. 


[me@linuxbox ~]$ echo "abcabcabc" | tr -s ab 
abcabcabc 


Text Processing 267 


268 


Chapter 20 


ROT13: THE NOT-SO-SECRET DECODER RING 


One amusing use of tr is to perform ROTI3 encoding of text. ROT13 is a trivial 
type of encryption based on a simple substitution cipher. Calling ROT13 encryp- 
tion is being generous; text obfuscation is more accurate. It is used sometimes 
on text to obscure potentially offensive content. The method simply moves each 
character 13 places up the alphabet. Because this is halfway up the possible 
26 characters, performing the algorithm a second time on the text restores it to 
its original form. Use the following to perform this encoding with tr: 


echo "secret text" | tr a-zA-Z n-za-mN-ZA-M 
frperg grkg 


Performing the same procedure a second time results in the following 


translation: 


echo "frperg grkg" | tr a-zA-Z n-za-mN-ZA-M 
secret text 


A number of email programs and Usenet news readers support ROT13 
encoding. Wikipedia contains a good article on the subject at http://en 
.wikipedia.org/wiki/ROT13. 


sed—Stream Editor for Filtering and Transforming Text 


The name sed is short for stream editor. It performs text editing on a stream 
of text, either a set of specified files or standard input. sed is a powerful and 
somewhat complex program (there are entire books about it), so we will not 
cover it completely here. 

In general, the way sed works is that it is given either a single editing 
command (on the command line) or the name of a script file containing 
multiple commands, and it then performs these commands upon each line 
in the stream of text. Here is a simple example of sed in action: 


[me@linuxbox ~]$ echo "front" | sed 's/front/back/' 
back 


In this example, we produce a one-word stream of text using echo and 
pipe it into sed. sed, in turn, carries out the instruction s/front/back/ upon 
the text in the stream and produces the output back as a result. We can 
also recognize this command as resembling the “substitution” (search-and- 
replace) command in vi. 

Commands in sed begin with a single letter. In the previous example, the 
substitution command is represented by the letter s and is followed by the 
search-and-replace strings, separated by the slash character as a delimiter. 
The choice of the delimiter character is arbitrary. By convention, the slash 


character is often used, but sed will accept any character that immediately 
follows the command as the delimiter. We could perform the same com- 
mand this way: 


[me@linuxbox ~]$ echo "front" | sed 's_front_back_' 
back 


By using the underscore character immediately after the command, it 
becomes the delimiter. The ability to set the delimiter can be used to make 
commands more readable, as we shall see. 

Most commands in sed may be preceded by an address, which specifies 
which line(s) of the input stream will be edited. If the address is omitted, 
then the editing command is carried out on every line in the input stream. 
The simplest form of address is a line number. We can add one to our 
example. 


[me@linuxbox ~]$ echo "front" | sed '1s/front/back/' 
back 


Adding the address 1 to our command causes our substitution to be 
performed on the first line of our one-line input stream. If we specify 
another number, we see that the editing is not carried out since our input 
stream does not have a line 2. 


[me@linuxbox ~]$ echo "front" | sed '2s/front/back/' 
front 


Addresses may be expressed in many ways. Table 20-7 lists the most 
common. 


Table 20-7: sed Address Notation 


Address Description 

n A line number where n is a positive integer. 

$ The last line. 

/regexp/ Lines matching a POSIX basic regular expression. Note that the 


regular expression is delimited by slash characters. Optionally, 
the regular expression may be delimited by an alternate character, 
by specifying the expression with \cregexpc, where c is the alternate 
character. 

addri1,addr2_-— A range of lines from addr1 to addr2, inclusive. Addresses may be any 
of the single address forms listed earlier. 

first~step Match the line represented by the number first and then each subse- 
quent line at step intervals. For example, 1~2 refers to each odd num- 
bered line, and 5~5 refers to the fifth line and every fifth line thereafter. 

addr1,+n Match addr1 and the following n lines. 


addr! Match all lines except addr, which may be any of the forms listed earlier. 
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We'll demonstrate different kinds of addresses using the distros.txt file 
from earlier in this chapter. First, here’s a range of line numbers: 


[me@linuxbox ~]$ sed -n '1,5p' distros.txt 


SUSE 10.2 12/07/2006 
Fedora 10 11/25/2008 
SUSE 11.0 06/19/2008 
Ubuntu 8.04 04/24/2008 
Fedora 8 11/08/2007 


In this example, we print a range of lines, starting with line 1 and con- 
tinuing to line 5. To do this, we use the p command, which simply causes 
a matched line to be printed. For this to be effective, however, we must 
include the option -n (the “no auto-print” option) to cause sed not to print 
every line by default. 

Next, we’ll try a regular expression. 


[me@linuxbox ~]$ sed -n '/SUSE/p' distros.txt 


SUSE 10.2 12/07/2006 
SUSE 11.0 06/19/2008 
SUSE 10.3 10/04/2007 
SUSE 10.1 05/11/2006 


By including the slash-delimited regular expression /SUSE/, we are able 
to isolate the lines containing it in much the same manner as grep. 

Finally, we'll try negation by adding an exclamation point (!) to the 
address. 


[me@linuxbox ~]$ sed -n '/SUSE/!p' distros.txt 


Fedora 10 11/25/2008 
Ubuntu 8.04 04/24/2008 
Fedora 8 11/08/2007 
Ubuntu 6.10 10/26/2006 
Fedora 7 05/31/2007 
Ubuntu 7.10 10/18/2007 
Ubuntu 7.04 04/19/2007 
Fedora 6 10/24/2006 
Fedora 9 05/13/2008 
Ubuntu 6.06 06/01/2006 
Ubuntu 8.10 10/30/2008 
Fedora 5 03/20/2006 


Here we see the expected result: all the lines in the file except the ones 
matched by the regular expression. 

So far, we’ve looked at two of the sed editing commands, s and p. 
Table 20-8 provides a more complete list of the basic editing commands. 


Table 20-8: sed Basic Editing Commands 


Command 


a 
d 


Q 


s/regexp/replacement/ 


y/set1/set2 


Description 

Output the current line number. 
Append text after the current line. 
Delete the current line. 

Insert text in front of the current line. 


Print the current line. By default, sed prints every line and 
only edits lines that match a specified address within the 
file. The default behavior can be overridden by specifying 
the -n option. 

Exit sed without processing any more lines. If the -n option 
is not specified, output the current line. 


Exit sed without processing any more lines. 


Substitute the contents of replacement wherever regexp is 
found. replacement may include the special character 8, 
which is equivalent to the text matched by regexp. In addi- 
tion, replacement may include the sequences \1 through \9, 
which are the contents of the corresponding subexpressions 
in regexp. For more about this, see the following discus- 
sion of back references. After the trailing slash following 
replacement, an optional flag may be specified to modify 
the s command's behavior. 


Perform transliteration by converting characters from set1 to 
the corresponding characters in set2. Note that unlike tr, 
sed requires that both sets be of the same length. 


The s command is by far the most commonly used editing command. 
We will demonstrate just some of its power by performing an edit on our 
distros.txt file. We discussed earlier how the date field in distros.txt was not in 
a “computer-friendly” format. While the date is formatted MM/DD/YYYY, 
it would be better (for ease of sorting) if the format were YYYY-MM-DD. 
Performing this change on the file by hand would be both time-consuming 
and error-prone, but with sed, this change can be performed in one step. 


[me@linuxbox ~]$ sed 's/\([0-9]\{2\}\)\/\([0-9]\{2\}\)\/\([0-9]\{4\}\ 


)$/\3-\1-\2/' distros.txt 


SUSE 
Fedora 
SUSE 
Ubuntu 
Fedora 
SUSE 
Ubuntu 
Fedora 


10.2 
10 
11.0 
8.04 
8 
10.3 
6.10 
7 


2006-12-07 
2008-11-25 
2008-06-19 
2008-04-24 
2007-11-08 
2007-10-04 
2006-10-26 
2007-05-31 
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Ubuntu 7.10 2007-10-18 


Ubuntu 7.04 2007-04-19 
SUSE 10.1 2006-05-11 
Fedora 6 2006-10-24 
Fedora 9 2008-05-13 
Ubuntu 6.06 2006-06-01 
Ubuntu 8.10 2008-10-30 
Fedora 5 2006-03-20 


Wow! Now that is an ugly-looking command. But it works. In just one 
step, we have changed the date format in our file. It is also a perfect example 
of why regular expressions are sometimes jokingly referred to as a write-only 
medium. We can write them, but we sometimes cannot read them. Before 
we are tempted to run away in terror from this command, let’s look at how 
it was constructed. First, we know that the command will have this basic 
structure: 


sed 's/regexp/replacement/' distros.txt 


Our next step is to figure out a regular expression that will isolate the 
date. Because it isin MM/DD/YYYY format and appears at the end of the 
line, we can use an expression like this: 


[0-9]{2}/[0-9]{2}/[0-9]{4}$ 


This matches two digits, a slash, two digits, a slash, four digits, and the 
end of line. So that takes care of regexp, but what about replacement? To handle 
that, we must introduce a new regular expression feature that appears in 
some applications that use BRE. This feature is called back references and works 
like this: if the sequence \n appears in replacement where n is a number from 
1 to 9, the sequence will refer to the corresponding subexpression in the pre- 
ceding regular expression. To create the subexpressions, we simply enclose 
them in parentheses like so: 


([0-9]{2})/([0-9]{2})/([0-9] {4})$ 


We now have three subexpressions. The first contains the month, the 
second contains the day of the month, and the third contains the year. Now 
we can construct replacement as follows: 


\3-\1-\2 


This gives us the year, a dash, the month, a dash, and the day. 
Now, our command looks like this: 


sed 's/([0-9]{2})/([0-9]{2})/([0-9]{4})$/\3-\4-\2/' distros.txt 


We have two remaining problems. The first is that the extra slashes 
in our regular expression will confuse sed when it tries to interpret the s 


command. The second is that because sed, by default, accepts only basic 
regular expressions, several of the characters in our regular expression will 
be taken as literals, rather than as metacharacters. We can solve both these 
problems with a liberal application of backslashes to escape the offending 
characters. 


sed 's/\([0-9]\{2\}\) \/\([0-9]\{2\}\) \/\([0-9]\{4\ }\) $/\3-\1-\2/" distros.txt 


And there you have it! 

Another feature of the s command is the use of optional flags that may 
follow the replacement string. The most important of these is the g flag, 
which instructs sed to apply the search-and-replace globally to a line, not 
just to the first instance, which is the default. Here is an example: 


[me@linuxbox ~]$ echo "aaabbbccc" | sed 's/b/B/' 
aaaBbbccc 


We see that the replacement was performed, but only to the first instance 
of the letter b, while the remaining instances were left unchanged. By adding 
the g flag, we are able to change all the instances. 


[me@linuxbox ~]$ echo "aaabbbccc" | sed 's/b/B/g' 
aaaBBBccc 


So far, we have only given sed single commands via the command line. 
It is also possible to construct more complex commands in a script file using 
the -f option. To demonstrate, we will use sed with our distros.txt file to build 
a report. Our report will feature a title at the top, our modified dates, and 
all the distribution names converted to uppercase. To do this, we will need 
to write a script, so we’ll fire up our text editor and enter the following: 


# sed script to produce Linux distributions report 


1 i\ 
\ 
Linux Distributions Report\ 


s/\([0-9]\{2\}\)\/\( [0-9] \{2\}\)\/\ (0-9 \{4\ }\)$/\3-\1-\2/ 
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPORSTUVWXYZ/ 


We will save our sed script as distros.sed and run it like this: 


[me@linuxbox ~]$ sed -f distros.sed distros.txt 


Linux Distributions Report 


SUSE 10.2 2006-12-07 
FEDORA 10 2008-11-25 
SUSE 11.0 2008-06-19 
UBUNTU 8.04 2008-04-24 
FEDORA 8 2007-11-08 
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SUSE 10.3 2007-10-04 


UBUNTU 6.10 2006-10-26 
FEDORA 7 2007-05-31 
UBUNTU 7.10 2007-10-18 
UBUNTU 7.04 2007-04-19 
SUSE 10.1 2006-05-11 
FEDORA 6 2006-10-24 
FEDORA 9 2008-05-13 
UBUNTU 6.06 2006-06-01 
UBUNTU 8.10 2008-10-30 
FEDORA 5 2006-03-20 


As we can see, our script produces the desired results, but how does it do 
it? Let’s take another look at our script. We’ll use cat to number the lines. 


[me@linuxbox ~]$ cat -n distros.sed 
1 # sed script to produce Linux distributions report 


2 
31 i\ 

4\ 

5 Linux Distributions Report\ 
6 

7 

8 


S/\([0-9]\{2\}\)\/\( [0-9] \{2\}N)\/\ ([0-9 J] \{4\}\) $/\3-\1-\2/ 
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPORSTUVWXYZ/ 


Line 1 of our script is a comment. Like many configuration files and pro- 
gramming languages on Linux systems, comments begin with the # character 
and are followed by human-readable text. Comments can be placed anywhere 
in the script (though not within commands themselves) and are helpful to 
any humans who might need to identify and/or maintain the script. 

Line 2 is a blank line. Like comments, blank lines may be added to 
improve readability. 

Many sed commands support line addresses. These are used to specify 
which lines of the input are to be acted upon. Line addresses may be 
expressed as single line numbers, line number ranges, and the special 
line number $, which indicates the last line of input. 

Lines 3, 4, 5, and 6 contain text to be inserted at the address 1, the first 
line of the input. The i command is followed by the sequence of a back- 
slash and then a carriage return to produce an escaped carriage return, or 
what is called a line-continuation character. This sequence, which can be used 
in many circumstances including shell scripts, allows a carriage return to 
be embedded in a stream of text without signaling the interpreter (in this 
case sed) that the end of the line has been reached. The i and the a (which 
appends text, rather than inserting it) and c (which replaces text) com- 
mands allow multiple lines of text as long as each line, except the last, ends 
with a line-continuation character. The sixth line of our script is actually 
the end of our inserted text and ends with a plain carriage return rather 
than a line-continuation character, signaling the end of the i command. 


A line-continuation character is formed by a backslash followed immediately by a 
carriage return. No intermediary spaces are permitted. 


Line 7 is our search-and-replace command. Since it is not preceded by 
an address, each line in the input stream is subject to its action. 

Line 8 performs transliteration of the lowercase letters into uppercase 
letters. Note that unlike tr, the y command in sed does not support character 
ranges (for example, [a-z]), nor does it support POSIX. Again, because the y 
command is not preceded by an address, it applies to every line in the input 
stream. 


PEOPLE WHO LIKE SEDALSO LIKE... 


sed is a capable program, able to perform fairly complex editing tasks to 
streams of text. It is most often used for simple, one-line tasks rather than long 
scripts. Many users prefer other tools for larger tasks. The most popular of these 
are awk and perl. These go beyond mere tools like the programs covered here 
and extend into the realm of complete programming languages. perl, in par- 
ticular, is often used instead of shell scripts for many system management and 
administration tasks, as well as being a popular medium for web development. 


awk is a little more specialized. Its specific strength is its ability to manipulate 


tabular data. It resembles sed in that awk programs normally process text files 
line by line, using a scheme similar to the sed concept of an address followed 
by an action. While both awk and perl are outside the scope of this book, they 
are good skills for the Linux command line user to learn. 


aspell—Interactive Spellchecker 


The last tool we will look at is aspell, an interactive spelling checker. The 
aspell program is the successor to an earlier program named ispell and can 
be used, for the most part, as a drop-in replacement. While the aspel1 pro- 
gram is mostly used by other programs that require spellchecking capability, 
it can also be used effectively as a stand-alone tool from the command line. 
It has the ability to intelligently check various types of text files, including 
HTML documents, C or C++ programs, email messages, and other kinds of 
specialized texts. 

To spellcheck a text file containing simple prose, it could be used like this: 


aspell check textfile 


where textfile is the name of the file to check. As a practical example, let’s 
create a simple text file named /foo.ixt containing some deliberate spelling 
errors. 
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[me@linuxbox ~]$ cat > foo.txt 
The quick brown fox jimped over the laxy dog. 


Next we’ll check the file using aspell. 


[me@linuxbox ~]$ aspell check foo.txt 


As aspell is interactive in the check mode, we will see a screen like this. 


The quick brown fox over the laxy dog. 


1) jumped 6) wimped 

2) gimped 7) camped 

3) comped 8) humped 

4) limped 9) impede 

5) pimped 0) umped 

i) Ignore I) Ignore all 
r) Replace R) Replace all 
a) Add 1) Add Lower 
b) Abort x) Exit 


2 


At the top of the display, we see our text with a suspiciously spelled 
word highlighted. In the middle, we see 10 spelling suggestions numbered 0 
through 9, followed by a list of other possible actions. Finally, at the bottom, 
we see a prompt ready to accept our choice. 

If we press the 1 key, aspell replaces the offending word with the word 
jumped and moves on to the next misspelled word, which is laxy. If we select 
the replacement lazy, aspell replaces it and terminates. Once aspell has 
finished, we can examine our file and see that the misspellings have been 
corrected. 


[me@linuxbox ~]$ cat foo.txt 
The quick brown fox jumped over the lazy dog. 


Unless told otherwise via the command line option --dont-backup, aspell 
creates a backup file containing the original text by appending the exten- 
sion .bak to the filename. 

Showing off our sed editing prowess, we’ll put our spelling mistakes 
back in so we can reuse our file. 


[me@linuxbox ~]$ sed -i 's/lazy/laxy/; s/jumped/jimped/' foo.txt 


The sed option -i tells sed to edit the file “in-place,” meaning that 
rather than sending the edited output to standard output, it will rewrite the 
file with the changes applied. We also see the ability to place more than one 
editing command on the line by separating them with a semicolon. 


Next, we’ll look at how aspell can handle different kinds of text files. 
Using a text editor such as vim (the adventurous may want to try sed), we will 
add some HTML markup to our file. 


<html> 
<head> 
<title>Mispelled HTML file</title> 
</head> 
<body> 
<p>The quick brown fox jimped over the laxy dog.</p> 
</body> 
</html> 


Now, if we try to spellcheck our modified file, we run into a problem. If 
we do it this way: 


[me@linuxbox ~]$ aspell check foo.txt 


we'll get this: 


<i> 
<head> 
<title>Mispelled HTML file</title> 
</head> 
<body> 
<p>The quick brown fox jimped over the laxy dog.</p> 
</body> 
</html> 
————————————S 
1) HTML 4) Hamel 
2) ht ml 5) Hamil 
3) ht-ml 6) hotel 
i) Ignore I) Ignore all 
r) Replace R) Replace all 
a) Add 1) Add Lower 
b) Abort x) Exit 


2 


aspell will see the contents of the HTML tags as misspelled. This 
problem can be overcome by including the -H (HTML) checking-mode 
option, like this: 


[me@linuxbox ~]$ aspell -H check foo.txt 


which will result in this: 


<html> 
<head> 
<title> REREWNER) HTML file</title> 
</head> 
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<body> 


</body> 
</html> 


1) Mi spelled 
2) Mi-spelled 
3) Misspelled 
4) Dispelled 
5) Spelled 

i) Ignore 

x) Replace 

a) Add 

b) Abort 


<p>The quick brown fox jimped over the laxy dog.</p> 


6) Misapplied 
7) Miscalled 
8) Respelled 
9) Misspell 

0) Misled 

I) Ignore all 
R) Replace all 
1) Add Lower 
x) Exit 


2 


The HTML is ignored, and only the non-markup portions of the file 
are checked. In this mode, the contents of HTML tags are ignored and not 
checked for spelling. However, the contents of ALT tags, which benefit from 
checking, are checked in this mode. 


By default, aspell will ignore URLs and email addresses in text. This behavior can 
be overridden with command line options. It is also possible to specify which markup 
tags are checked and skipped. See the aspell man page for details. 


Summing Up 


In this chapter, we looked at a few of the many command line tools 

that operate on text. In the next chapter, we will look at several more. 
Admittedly, it may not seem immediately obvious how or why you might 
use some of these tools on a day-to-day basis, though we have tried to 
show some practical examples of their use. We will find in later chapters 
that these tools form the basis of a tool set that is used to solve a host of 
practical problems. This will be particularly true when we get into shell 
scripting, where these tools will really show their worth. 


Extra Credit 


There are a few more interesting text-manipulation commands worth 
investigating. Among these are split (split files into pieces), csplit (split 
files into pieces based on context), and sdiff (side-by-side merge of file 
differences). 
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FORMATTING OUTPUT 


In this chapter, we continue our look at 
text-related tools, focusing on programs 
that are used to format text output, rather 

than changing the text itself. These tools are 
often used to prepare text for eventual printing, a 
subject that we will cover in the next chapter. We will 
cover the following programs in this chapter: 


nl Number lines 

fold Wrap each line to a specified length 
fmt A simple text formatter 

pr Prepare text for printing 

printf Format and print data 


groff A document formatting system 


Simple Formatting Tools 


We’ll look at some of the simple formatting tools first. These are mostly 
single-purpose programs and a bit unsophisticated in what they do, but 
they can be used for small tasks and as parts of pipelines and scripts. 


nl—Number Lines 


The nl program is a rather arcane tool used to perform a simple task. It 
numbers lines. In its simplest use, it resembles cat -n. 


[me@linuxbox ~]$ nl distros.txt | head 


1 SUSE 10.2 12/07/2006 
2 Fedora 10 11/25/2008 
3 SUSE 11.0 06/19/2008 
4 Ubuntu 8.04 04/24/2008 
5 Fedora 8 11/08/2007 
6 SUSE 10.3 10/04/2007 
7 Ubuntu 6.10 10/26/2006 
8 Fedora 7 05/31/2007 
9 Ubuntu 7.10 10/18/2007 
10 Ubuntu 7.04 04/19/2007 


Like cat, nl can accept either multiple files as command line arguments 
or standard input. However, nl has a number of options and supports a 
primitive form of markup to allow more complex kinds of numbering. 

nl supports a concept called logical pages when numbering. This allows nl 
to reset (start over) the numerical sequence when numbering. Using options, 
it is possible to set the starting number to a specific value and, to a limited 
extent, its format. A logical page is further broken down into a header, body, 
and footer. Within each of these sections, line numbering may be reset and/ 
or be assigned a different style. If nl is given multiple files, it treats them as 
a single stream of text. Sections in the text stream are indicated by the pres- 
ence of some rather odd-looking markup added to the text, as described in 
Table 21-1. 


Table 21-1: nl Markup 
Markup Meaning 


\sisie Start of logical page header 
\ENs Start of logical page body 
\g Start of logical page footer 


Each of the markup elements listed in Table 21-1 must appear alone on 
its own line. After processing a markup element, nl deletes it from the text 
stream. 

Table 21-2 lists the common options for nl. 
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Table 21-2: Common nl Options 
Option Meaning 


-b style Set body numbering to style, where style is one of the following: 
a = Number all lines. 
t = Number only non-blank lines. This is the default. 
n= None. 
pregexp = Number only lines matching basic regular expression regexp. 


-f style Set footer numbering to style. The default is n (none). 

-h style Set header numbering to style. The default is n (none). 

-i number Set page numbering increment to number. The default is 1. 

-n format Sets numbering format to format, where format is one of the following: 


1n = Left justified, without leading zeros. 
rn = Right justified, without leading zeros. This is the default. 
1z = Right justified, with leading zeros. 
-p Do not reset page numbering at the beginning of each logical page. 


-s string Add string to the end of each line number to create a separator. The 
default is a single tab character. 


-v number _—_ Set the first line number of each logical page to number. The default is 1. 


-w width Set the width of the line number field to width. The default is 6. 


Admittedly, we probably won’t be numbering lines that often, but we 
can use nl to look at how we can combine multiple tools to perform more 
complex tasks. We will build on our work in the previous chapter to pro- 
duce a Linux distributions report. Since we will be using n1, it will be useful 
to include its header/body/footer markup. To do this, we will add it to the 
sed script from the previous chapter. Using our text editor, we will change 
the script as follows and save it as distros-nl.sed: 


# sed script to produce Linux distributions report 


1 i\ 

\W\As\V2\ 

\ 

Linux Distributions Report\ 
\ 

Name Ver. Released\ 
ae oko esate \ 
\Wi\\i 
S/\([0-9]\{2\}\)\/\( [0-9 J \{2\ FN) \Z\ (0-9 \{4\ FN) $/\3-\2-\2/ 
$ a\ 

\W:\ 

\ 

End Of Report 


The script now inserts the nl logical page markup and adds a footer at 
the end of the report. Note that we had to double up the backslashes in our 
markup because they are normally interpreted as an escape character by sed. 
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Next, we’ll produce our enhanced report by combining sort, sed, and nl. 


[me@linuxbox ~]$ sort -k 1,1 -k 2n distros.txt | sed -f distros-nl.sed | nl 


Linux Distributions Report 


Name Ver. Released 
1 Fedora 5 2006-03-20 
2 Fedora 6 2006-10-24 
3 Fedora 7 2007-05-31 
4 Fedora 8 2007-11-08 
5 Fedora 9 2008-05-13 
6 Fedora 10 2008-11-25 
7 SUSE 10.1 2006-05-11 
8 SUSE 10.2 2006-12-07 
9 SUSE 10.3 2007-10-04 
10 SUSE 11.0 2008-06-19 
11 Ubuntu 6.06 2006-06-01 
12 Ubuntu 6.10 2006-10-26 
13 Ubuntu 7.04 2007-04-19 
14 Ubuntu 7.10 2007-10-18 
15 Ubuntu 8.04 2008-04-24 
16 Ubuntu 8.10 2008-10-30 


End Of Report 


Our report is the result of our pipeline of commands. First, we sort the 
list by distribution name and version (fields 1 and 2), and then we process 
the results with sed, adding the report header (including the logical page 
markup for nl) and footer. Finally, we process the result with nl, which, by 
default, only numbers the lines of the text stream that belong to the body 
section of the logical page. 

We can repeat the command and experiment with different options for 
nl. Some interesting ones are the following: 


nl -n xz 


and the following: 


nl -w 3 -s 


fold—Wrap Each Line to a Specified Length 


Folding is the process of breaking lines of text at a specified width. Like our 
other commands, fold accepts either one or more text files or standard input. 
If we send fold a simple stream of text, we can see how it works. 
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[me@linuxbox ~]$ 
The quick br 

own fox jump 

ed over the 

lazy dog. 


echo "The quick brown fox jumped over the lazy dog." | fold -w 12 


Here we see fold in action. The text sent by the echo command is broken 
into segments specified by the -w option. In this example, we specify a line 
width of 12 characters. If no width is specified, the default is 80 characters. 
Notice how the lines are broken regardless of word boundaries. The addi- 
tion of the -s option will cause fold to break the line at the last available 
space before the line width is reached. 


[me@linuxbox ~]$ 
The quick 

brown fox 

jumped over 

the lazy 

dog. 


echo "The quick brown fox jumped over the lazy dog." | fold -w 12 -s 


fmt—A Simple Text Formatter 


The fmt program also folds text, plus a lot more. It accepts either files or 
standard input and performs paragraph formatting on the text stream. 
Basically, it fills and joins lines in text while preserving blank lines and 
indentation. 

To demonstrate, we’ll need some text. Let’s lift some from the fmt 
info page. 


~fmt' reads from the specified FILE arguments (or standard input if none 
are given), and writes to standard output. 


By default, blank lines, spaces between words, and indentation are 
preserved in the output; successive input lines with different 
indentation are not joined; tabs are expanded on input and introduced on 
output. 


~fmt' prefers breaking lines at the end of a sentence, and tries to avoid 
line breaks after the first word of a sentence or before the last word of a 
sentence. A "sentence break" is defined as either the end of a paragraph 
or a word ending in any of ~.?!', followed by two spaces or end of line, 
ignoring any intervening parentheses or quotes. Like TeX, ~fmt' reads entire 
"paragraphs" before choosing line breaks; the algorithm is a variant of that 
given by Donald E. Knuth and Michael F. Plass in "Breaking Paragraphs Into 
Lines", ~Software--Practice & Experience’ 11, 11 (November 1981), 1119-1184. 


We'll copy this text into our text editor and save the file as /mt-info.txt. 
Now, let’s say we wanted to reformat this text to fit a 50-character-wide 
column. We could do this by processing the file with fmt and the -w option. 
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[me@linuxbox ~]$ fmt -w 50 fmt-info.txt | head 
~fmt' reads from the specified FILE arguments 
(or standard input if 

none are given), and writes to standard output. 


By default, blank lines, spaces between words, 
and indentation are 
preserved in the output; successive input lines 
with different indentation are not joined; tabs 
are expanded on input and introduced on output. 


Well, that’s an awkward result. Perhaps we should actually read this text 
since it explains what’s going on. 


By default, blank lines, spaces between words, and indentation 
are preserved in the output; successive input lines with different 
indentation are not joined; tabs are expanded on input and intro- 
duced on output. 


So, fmt is preserving the indentation of the first line. Fortunately, fmt 
provides an option to correct this. 


[me@linuxbox ~]$ fmt -cw 50 fmt-info.txt 

~fmt' reads from the specified FILE arguments 
(or standard input if none are given), and writes 
to standard output. 


By default, blank lines, spaces between words, 
and indentation are preserved in the output; 
successive input lines with different indentation 
are not joined; tabs are expanded on input and 
introduced on output. 


~fmt' prefers breaking lines at the end of a 
sentence, and tries to avoid line breaks after 
the first word of a sentence or before the 
last word of a sentence. A "sentence break" 
is defined as either the end of a paragraph 
or a word ending in any of ~.?!', followed 
by two spaces or end of line, ignoring any 
intervening parentheses or quotes. Like TeX, 
~fmt' reads entire "paragraphs" before choosing 
line breaks; the algorithm is a variant of 
that given by Donald E. Knuth and Michael F. 
Plass in "Breaking Paragraphs Into Lines", 
~Software--Practice & Experience’ 11, 11 
(November 1981), 1119-1184. 


That’s much better. By adding the -c option, we now have the desired 
result. 
fmt has some interesting options, as described in Table 21-3. 
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Table 21-3: fmt Options 
Option Description 


-C Operate in crown margin mode. This preserves the indentation of the 
first two lines of a paragraph. Subsequent lines are aligned with the 
indentation of the second line. 


-p string Format only those lines beginning with the prefix string. After formatting, 
the contents of string are prefixed to each reformatted line. This option 
can be used to format text in source code comments. For example, any 
programming language or configuration file that uses a # character to 
delineate a comment could be formatted by specifying -p '# ' so that 
only the comments will be formatted. See the example that follows. 


-s Split-only mode. In this mode, lines will only be split to fit the specified 
column width. Short lines will not be joined to fill lines. This mode is use- 
ful when formatting text such as code where joining is not desired. 


-u Perform uniform spacing. This will apply traditional “typewriter-style” 
formatting to the text. This means a single space between words and 
two spaces between sentences. This mode is useful for removing justifi- 
cation, that is, text that has been padded with spaces to force alignment 
on both the left and right margins. 


-w width Format text to fit within a column width characters wide. The default is 
75 characters. Note: fmt actually formats lines slightly shorter than the 
specified width to allow for line balancing. 


The -p option is particularly interesting. With it, we can format selected 
portions of a file, provided that the lines to be formatted all begin with the 
same sequence of characters. Many programming languages use the pound 
sign (#) to indicate the beginning of a comment and thus can be formatted 
using this option. Let’s create a file that simulates a program that uses 
comments. 


[me@linuxbox ~]$ cat > fmt-code.txt 
# This file contains code with comments. 


# This line is a comment. 
# Followed by another comment line. 
# And another. 


This, on the other hand, is a line of code. 
And another line of code. 
And another. 


Our sample file contains comments that begin with the string # (a # 
followed by a space) and lines of “code” that do not. Now, using fmt, we can 
format the comments and leave the code untouched. 


[me@linuxbox ~]$ fmt -w 50 -p '# ' fmt-code.txt 
# This file contains code with comments. 
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# This line is a comment. Followed by another 
# comment line. And another. 


This, on the other hand, is a line of code. 
And another line of code. 
And another. 


Notice that the adjoining comment lines are joined, while the blank 
lines and the lines that do not begin with the specified prefix are preserved. 


pr—Format Text for Printing 


The pr program is used to paginate text. When printing text, it is often 
desirable to separate the pages of output with several lines of whitespace 
to provide a top margin and a bottom margin for each page. Further, this 
whitespace can be used to insert a header and footer on each page. 

We’ll demonstrate pr by formatting our distros.txt file into a series of 
short pages (only the first two pages are shown). 


[me@linuxbox ~]$ pr -1 15 -w 65 distros.txt 


2016-12-11 18:27 distros.txt Page 1 
SUSE 10.2 12/07/2006 
Fedora 10 11/25/2008 
SUSE 11.0 06/19/2008 
Ubuntu 8.04 04/24/2008 
Fedora 8 11/08/2007 
2016-12-11 18:27 distros.txt Page 2 
SUSE 10.3 10/04/2007 
Ubuntu 6.10 10/26/2006 
Fedora 7 05/31/2007 
Ubuntu 7.10 10/18/2007 
Ubuntu 7.04 04/19/2007 


In this example, we employ the -1 option (for page length) and the -w 
option (page width) to define a “page” that is 65 columns wide and 15 lines 
long. pr paginates the contents of the distros.txt file, separates each page 
with several lines of whitespace, and creates a default header containing 


the file modification time, filename, and page number. The pr program 
provides many options to control page layout, which we’ll see in Chapter 22. 


printf—Format and Print Data 


Unlike the other commands in this chapter, the printf command is not used 
for pipelines (it does not accept standard input) nor does it find frequent 
application directly on the command line (it’s mostly used in scripts). So why 
is it important? Because it is so widely used. 

printf (from the phrase print formatted) was originally developed for the 
C programming language and has been implemented in many programming 
languages including the shell. In fact, in bash, printf is a builtin. 

printf works like this: 


printf "format" arguments 


The command is given a string containing a format description, which 
is then applied to a list of arguments. The formatted result is sent to standard 
output. Here is a trivial example: 


[me@linuxbox ~]$ printf "I formatted the string: %s\n" foo 
I formatted the string: foo 


The format string may contain literal text (like “I formatted the string:”), 
escape sequences (such as \n, a newline character), and sequences beginning 
with the % character, which are called conversion specifications. In the preced- 
ing example, the conversion specification %s is used to format the string “foo” 
and place it in the command ’s output. Here it is again: 


[me@linuxbox ~]$ printf "I formatted '%s' as a string.\n" foo 
I formatted 'foo' as a string. 


As we can see, the %s conversion specification is replaced by the string 
“foo” in the command’s output. The s conversion is used to format string 
data. There are other specifiers for other kinds of data. Table 21-4 lists the 
commonly used data types. 


Table 21-4: Common printf Data Type Specifiers 


Specifier Description 


d Format a number as a signed decimal integer. 

f Format and output a floating-point number. 

) Format an integer as an octal number. 

s Format a string. 

x Format an integer as a hexadecimal number using lowercase a to f where 
needed. 

X Same as x but use uppercase letters. 


oe 


Print a literal % symbol (i.e., specify %%). 
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We'll demonstrate the effect each of the conversion specifiers on the 
string 380. 


[me@linuxbox ~]$ printf "%d, %Ff, %o, %s, %X, *%X\n" 380 380 380 380 380 380 
380, 380.000000, 574, 380, 17c, 17C 


Because we specified six conversion specifiers, we must also supply six 
arguments for printf to process. The six results show the effect of each 
specifier. 

Several optional components may be added to the conversion specifier 
to adjust its output. A complete conversion specification may consist of the 
following: 


*[ flags || width][ .precision]conversion_specification 


Multiple optional components, when used, must appear in the order 
specified earlier to be properly interpreted. Table 21-5 describes each. 


Table 21-5: printf Conversion Specification Components 


Component Description 
flags There are five different flags: 


#: Use the alternate format for output. This varies by data type. For o 
(octal number) conversion, the output is prefixed with 0. For x and X 
(hexadecimal number) conversions, the output is prefixed with ox or OX, 
respectively. 


0 (zero): Pad the output with zeros. This means that the field will be 
filled with leading zeros, as in 000380. 


- (dash): Left-align the output. By default, printf right-aligns output. 
' " (space): Produce a leading space for positive numbers. 


+ (plus sign): Sign positive numbers. By default, printf only signs nega- 
tive numbers. 


width A number specifying the minimum field width. 


-precision For floating-point numbers, specify the number of digits of precision 
to be output after the decimal point. For string conversion, precision 
specifies the number of characters to output. 


Table 21-6 lists some examples of different formats in action. 


Table 21-6: printf Conversion Specification Examples 


Argument Format Result Notes 
380 nde 380 Simple formatting of an integer. 
380 "TX" 0x17¢ Integer formatted as a hexadecimal num- 


ber using the “alternate format” flag. 


Argument Format Result Notes 


380 "%05d" 00380 Integer formatted with leading zeros 
(padding) and a minimum field width of 
five characters. 

380 "%05.5f"  380.00000 Number formatted as a floating-point 
number with padding and five decimal 
places of precision. Since the specified 
minimum field width (5) is less than the 
actual width of the formatted number, the 
padding has no effect. 

380 "%010.5f"  0380.00000 By increasing the minimum field width to 
10, the padding is now visible. 


380 "Sed" +380 The + flag signs a positive number. 

380 "%-d" 380 The - flag left-aligns the formatting. 

abcdefghijk  "%5s" abcedfghijk A string formatted with a minimum field 
width. 

abcdefghijk "4%.5s" abcde By applying precision to a string, it is 
truncated. 


Again, printf is used mostly in scripts where it is employed to format 
tabular data, rather than on the command line directly. But we can still 
show how it can be used to solve various formatting problems. First, let’s 
output some fields separated by tab characters. 


[me@linuxbox ~]$ printf "%s\t%s\t%s\n" str1 str2 str3 
str1 str2 str3 


By inserting \t (the escape sequence for a tab), we achieve the desired 
effect. Next, here are some numbers with neat formatting: 


[me@linuxbox ~]$ printf "Line: %05d %15.3f Result: %+15d\n" 1071 3.14156295 
32589 
Line: 01071 3.142 Result: +32589 


This shows the effect of minimum field width on the spacing of the 
fields. Or how about formatting a tiny web page? 


[me@linuxbox ~]$ printf "<html>\n\t<head>\n\t\t<title>%s</title>\n\t</head>\n\ 
t<body>\n\t\t<p>%s</p>\n\t</body>\n</html>\n" "Page Title" "Page Content" 
<html> 
<head> 
<title>Page Title</title> 
</head> 
<body> 
<p>Page Content</p> 
</body> 
</html> 
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So far, we have examined the simple text-formatting tools. These are good 
for small, simple tasks, but what about larger jobs? One of the reasons that 
Unix became a popular operating system among technical and scientific 
users (aside from providing a powerful multitasking, multiuser environment 
for all kinds of software development) is that it offered tools that could be 
used to produce many types of documents, particularly scientific and aca- 
demic publications. In fact, as the GNU documentation describes, docu- 
ment preparation was instrumental to the development of Unix. 


The first version of UNIX was developed on a PDP-7 which was 
sitting around Bell Labs. In 1971 the developers wanted to get 
a PDP-11 for further work on the operating system. In order to 
justify the cost for this system, they proposed that they would 
implement a document formatting system for the AT&T patents 
division. This first formatting program was a reimplementation 
of MclIllroy's roff, written by J. F. Ossanna. 


Two main families of document formatters dominate the field: those 
descended from the original roff program, including nroff and troff, and 
those based on Donald Knuth’s TEX (pronounced “tek”) typesetting sys- 
tem. And yes, the dropped Fin the middle is part of its name. 

The name roffis derived from the term run offas in, “I'll run offa 
copy for you.” The nroff program is used to format documents for out- 
put to devices that use monospaced fonts, such as character terminals 
and typewriter-style printers. At the time of its introduction, this included 
nearly all printing devices attached to computers. The later troff pro- 
gram formats documents for output on typesetters, devices used to produce 
“camera-ready” type for commercial printing. Most computer printers today 
are able to simulate the output of typesetters. The roff family also includes 
some other programs that are used to prepare portions of documents. 
These include eqn (for mathematical equations) and tbl (for tables). 

The TEX system (in stable form) first appeared in 1989 and has, to 
some degree, displaced troff as the tool of choice for typesetter output. 
We won't be covering TEX here, both because of its complexity (there are 
entire books about it) and because it is not installed by default on most 
modern Linux systems. 


For those interested in installing TEX, check out the texlive package, which can be 
found in most distribution repositories, and the LyX graphical content editor. 


groff 


groff is a suite of programs containing the GNU implementation of troff. 
It also includes a script that is used to emulate nroff and the rest of the roff 
family as well. 


While roff and its descendants are used to make formatted documents, 
they do it in a way that is rather foreign to modern users. Most documents 
today are produced using word processors that are able to perform both the 
composition and the layout of a document in a single step. Prior to the advent 
of the graphical word processor, documents were often produced in a two- 
step process involving the use of a text editor to perform composition, and a 
processor, such as troff, to apply the formatting. Instructions for the format 
ting program were embedded into the composed text through the use of a 
markup language. The modern analog for such a process is the web page, 
which is composed using a text editor of some kind and then rendered by a 
web browser using HTML as the markup language to describe the final page 
layout. 

We're not going to cover groff in its entirety, as many elements of its 
markup language deal with rather arcane details of typography. Instead, we 
will concentrate on one of its macro packages that remains in wide use. These 
macro packages condense many of its low-level commands into a smaller set 
of high-level commands that make using groff much easier. 

For a moment, let’s consider the humble man page. It lives in the /usr/ 
share/man directory as a gzip-compressed text file. If we were to examine its 
uncompressed contents, we would see the following (the man page for 1s in 
section | is shown): 


[me@linuxbox ~]$ zcat /usr/share/man/mani/1s.1.gz | head 


-\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.47.3. 
-TH LS "1" "January 2018" "GNU coreutils 8.28" "User Commands" 
«SH NAME 

Is \- list directory contents 

«SH SYNOPSIS 

-B ls 

[\fI\,OPTION\/\FR]... [\FI\,FILE\/\FR]... 

«SH DESCRIPTION 

-\" Add any additional description here 

-PP 


Compared to the man page in its normal presentation, we can begin to 
see a correlation between the markup language and its results. 


[me@linuxbox ~]$ man 1s | head 
LS(1) User Commands LS(1) 


NAME 
Is - list directory contents 


SYNOPSIS 
ls [OPTION]... [FILE]... 
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The reason this is of interest is that man pages are rendered by groff, 
using the mandoc macro package. In fact, we can simulate the man command 
with the following pipeline: 


[me@linuxbox ~]$ zcat /usr/share/man/mani/1s.1.gz | groff -mandoc -T ascii 
head 
LS(1) User Commands LS(1) 


NAME 
Is - list directory contents 


SYNOPSIS 
ls [OPTION]... [FILE]... 


Here we use the groff program with the options set to specify the mandoc 
macro package and the output driver for ASCII. groff can produce output in 
several formats. If no format is specified, PostScript is output by default. 


[me@linuxbox ~]$ zcat /usr/share/man/mani/1s.1.gz | groff -mandoc | head 
%!PS-Adobe-3.0 

“#aCreator: groff version 1.18.1 

woCreationDate: Thu Feb 5 13:44:37 2009 
z#DocumentNeededResources: font Times-Roman 

m+ font Times-Bold 

wo+ font Times-Italic 
#sDocumentSuppliedResources: procset grops 1.18 1 
waPages: 4 

waPageOrder: Ascend 

*eOLientation: Portrait 


We briefly mentioned PostScript in the previous chapter and will 
again in the next chapter. PostScript is a page description language that is 
used to describe the contents of a printed page to a typesetter-like device. 
If we take the output of our command and store it to a file (assuming that 
we are using a graphical desktop with a Desktop directory), an icon for the 
output file should appear on the desktop. 


[me@linuxbox ~]$ 


zcat /usr/share/man/mani/1ls.1.gz | groff -mandoc > ~/Desktop/1s.ps 
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By double-clicking the icon, a page viewer should start up and reveal 
the file in its rendered form, as shown in Figure 21-1. 

What we see is a nicely typeset man page for 1s! In fact, it’s possible to 
convert the PostScript file into a Portable Document Format (PDF) file with 
this command: 


[me@linuxbox ~]$ ps2pdf ~/Desktop/1s.ps ~/Desktop/1s.pdf 


The ps2pdf program is part of the ghostscript package, which is installed 
on most Linux systems that support printing. 


mmm = 


Lsay User Commands LSi 
NAME 
1s = list directory contents 
SYNOPSIS 
Is [OPTION]... [FILE] 
DESCRIPTION 


List information about the FILEs (the current directory by default), Sort entries alphabetically if none of 
~cftuySUX nor —sort is specified. 


Mandatory arguments to long options are mandatory for short options too. 
~a, ~-all 

do not ignore entries starting with . 
i A, ~~almost~all 

do not list implied .and .. 
—anthor 

with =I, print the author of each file 
—b, —escape 

print C=style escapes for nongraphic characters 
—block=size=S/ZE 

scale sizes by SIZE before printing them: ¢.g., "—-block-size=M" prints sizes in units of 
— 1,048,576 bytes; see SIZE format below 


~B, —-ignore-backups 
do not list implied entries ending with ~ 


-< with It: sort by, and show, ctime (time of last modification of file status information); with -k 
show ctime and sort by name; otherwise: sort by ctime, mewest first 


Cc list entries by columns 
—color[2WHEN] 
colorize the output; WHEN can be ‘olways’ (default if omined), ‘auto’, o¢ “never; more info 
4 below 


-d, —-directory 
list directories themselves, not their contents 


—D, —dired 
gencrate output designed for Emacs" dired mode 


- do not sort, enable ~aU. disable ~-bs -—color 


Figure 21-1: Viewing PostScript output with a page viewer in GNOME 


Linux systems often include many command line programs for file format conversion. 
They are often named using the convention of format2format. Try using the com- 
mand 1s /usr/bin/*[[:alpha: ]]2[[:alpha:]]* to identify them. Also try searching 
Jor programs named formattoformat. 


For our last exercise with groff, we will revisit our old friend distros. txt. 
This time, we will use the tbl program, which is used to format tables to 
typeset our list of Linux distributions. To do this, we are going to use our 
earlier sed script to add markup to a text stream that we will feed to groff. 

First, we need to modify our sed script to add the necessary markup ele- 
ments (called requests in groff) that tbl requires. Using a text editor, we will 
change distros.sed to the following: 


# sed script to produce Linux distributions report 


1 i\ 

-TS\ 

center box;\ 

cb s s\ 

cb cb cb\ 

lnc.\ 

Linux Distributions Report\ 
=\ 
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Name Version Released\ 


s/\([0-9]\{2\}\)\/\([0-9] \{2\\) \/\ [0-9] \{4\ \) $/\3-\1-\2/ 
$ a\ 
TE 


Note that for the script to work properly, care must been taken to see 
that the words Name Version Released are separated by tabs, not spaces. We'll 
save the resulting file as distros-ibl.sed. tbl uses the .TS and .TE requests to start 
and end the table. The rows following the .TS request define global properties 
of the table, which, for our example, are centered horizontally on the page 
and surrounded by a box. The remaining lines of the definition describe the 
layout of each table row. Now, if we run our report-generating pipeline again 
with the new sed script, we’ll get the following: 


[me@linuxbox ~]$ sort -k 1,1 -k 2n distros.txt | sed -f distros-tbl.sed | groff -t -T ascii 


dpe cece eee eee ee ee eee e eee e eee + 
Linux Distributions Report 

pon eee een een eee nnn e nee e eee + 
Name Version Released 

poe eee een een ene ene e enn eenee + 
Fedora 5 2006-03-20 
Fedora 6 2006-10-24 
Fedora 7 2007-05-31 
Fedora 8 2007-11-08 
Fedora 9 2008-05-13 
Fedora 10 2008-11-25 
SUSE 10.1 2006-05-11 
SUSE 10.2 2006-12-07 
SUSE 10.3 2007-10-04 
SUSE 11.0 2008-06-19 
Ubuntu 6.06 2006-06-01 
Ubuntu 6.10 2006-10-26 
Ubuntu 7.04 2007-04-19 
Ubuntu 7.10 2007-10-18 
Ubuntu 8.04 2008-04-24 
Ubuntu 8.10 2008-10-30 


294 


Chapter 21 


Adding the -t option to groff instructs it to preprocess the text stream 
with tbl. Likewise, the -T option is used to output to ASCII rather than the 
default output medium, PostScript. 

The format of the output is the best we can expect if we are limited 
to the capabilities of a terminal screen or typewriter-style printer. If we 
specify PostScript output and graphically view the output, we get a much 
more satisfying result, as shown in Figure 21-2. 


[me@linuxbox ~]$ sort -k 1,1 -k 2n distros.txt | sed -f distros-tbl.sed | 
groff -t > ~/Desktop/distros.ps 


f+ | oft | distros.ps 150% Vv Q = x] 


Thee Linux Distributions Report 
Name _ Version Released 
Fedora 5 2006-03-20 
Fedora 6 2006-10-24 
Fedora 7 2007-05-31 
Fedora 8 2007-11-08 
Fedora 9 2008-05-13 
Fedora 10 2008-11-25 
SUSE 10.1 2006-05-11 
SUSE 10.2 2006-12-07 
SUSE 10.3 2007-10-04 
SUSE 11.0 2008-06-19 
Ubuntu 6.06 2006-06-01 
Ubuntu 6.10 2006-10-26 
Ubuntu 7.04 2007-04-19 
Ubuntu 7.10 2007-10-18 
Ubuntu 8.04 2008-04-24 
Ubuntu 8.10 2008-10-30 


Figure 21-2: Viewing the finished table 


Summing Up 


Given that text is so central to the character of Unix-like operating systems, 
it makes sense that there would be many tools that are used to manipulate 
and format text. As we have seen, there are! The simple formatting tools 
like fmt and pr will find many uses in scripts that produce short documents, 
while groff (and friends) can be used to write books. We may never write a 
technical paper using command line tools (though there are some people 
who do!), but it’s good to know that we could. 
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PRINTING 


After spending the last couple of chapters 
manipulating text, it’s time to put that text 


on paper. In this chapter, we’ll look at the 
command line tools that are used to print files 
and control printer operation. We won’t be looking 


at how to configure printing because that varies from distribution to distri- 
bution and is usually set up automatically during installation. Note that we 
will need a working printer configuration to perform the exercises in this 
chapter. 

We will discuss the following commands: 

pr Convert text files for printing 

lpr Print files 

a2ps_ Format files for printing on a PostScript printer 

Ipstat Show printer status information 

lpq Show printer queue status 


lprm Cancel print jobs 
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A Brief History of Printing 
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To fully understand the printing features found in Unix-like operating sys- 
tems, we must first learn some history. Printing on Unix-like systems goes 
way back to the beginning of the operating system. In those days, printers 
and how they were used were much different from today. 


Printing in the Dim Times 


Like computers, printers in the pre-PC era tended to be large, expensive, 
and centralized. The typical computer user of 1980 worked at a terminal 
connected to a computer some distance away. The printer was located near 
the computer and was under the watchful eyes of the computer’s operators. 

When printers were expensive and centralized, as they often were in 
the early days of Unix, it was common practice for many users to share a 
printer. To identify print jobs belonging to a particular user, a banner page 
displaying the name of the user was often printed at the beginning of each 
print job. The computer support staff would then load up a cart containing 
the day’s print jobs and deliver them to the individual users. 


Character-Based Printers 


The printer technology of the 1980s was very different from today in two 
respects. First, printers of that period were almost always impact printers. 
Impact printers use a mechanical mechanism that strikes a ribbon against 
the paper to form character impressions on the page. Two of the popular 
technologies of that time were daisy-wheel printing and dot-matrix printing. 

The second, and more important, characteristic of early printers was 
that printers used a fixed set of characters that were intrinsic to the device. 
For example, a daisy-wheel printer could print only the characters actually 
molded into the petals of the daisy wheel. This made the printers much 
like high-speed typewriters. As with most typewriters, they printed using 
monospaced (fixed-width) fonts. This means that each character has the 
same width. Printing was done at fixed positions on the page, and the print- 
able area of a page contained a fixed number of characters. Most printers 
printed 10 characters per inch (CPI) horizontally and 6 lines per inch 
(LPI) vertically. Using this scheme, a US-letter sheet of paper is 85 charac- 
ters wide and 66 lines high. Taking into account a small margin on each 
side, 80 characters was considered the maximum width of a print line. This 
explains why terminal displays (and our terminal emulators) are normally 
80 characters wide. Using a monospaced font and an 80-character-wide 
terminal provides a what-you-see-is-what-you-get (WYSIWYG, pronounced 
“whizzy-wig”) view of printed output. 

Data is sent to a typewriter-like printer in a simple stream of bytes 
containing the characters to be printed. For example, to print an a, the 
ASCII character code 97 is sent. In addition, the low-numbered ASCII 
control codes provided a means of moving the printer’s carriage and 
paper, using codes for carriage return, line feed, form feed, and so on. 
Using the control codes, it’s possible to achieve some limited font effects, 


such as boldface, by having the printer print a character, backspace, and 
print the character again to get a darker print impression on the page. We 
can actually witness this if we use nroff to render a man page and examine 
the output using cat -A. 


[me@linuxbox ~]$ zcat /usr/share/man/mani/1s.1.gz | nroff -man | cat -A | head 
LS(1) User Commands LS(1) 
$ 
$ 
$ 
N*HNA*HAM*HME“HE$ 
Is - list directory contents$ 
$ 
S*HSY*HYN*HNO“HOP*HPS*HSI“HIS*HS$ 
1H1s*Hs [_*HO “HP_AHT “HI “HO “HN]... [_*HF_“HI “HL *HE]...$ 


The “H (CTRL-H) characters are the backspaces used to create the bold- 
face effect. Likewise, we can also see a backspace/underscore sequence 
used to produce underlining. 


Graphical Printers 


The development of GUIs led to major changes in printer technology. As 
computers moved to more picture-based displays, printing moved from 
character-based to graphical techniques. This was facilitated by the advent 
of the low-cost laser printer, which, instead of printing fixed characters, 
could print tiny dots anywhere in the printable area of the page. This made 
printing proportional fonts (like those used by typesetters), and even pho- 
tographs and high-quality diagrams, possible. 

However, moving from a character-based scheme to a graphical scheme 
presented a formidable technical challenge. Here’s why: the number of bytes 
needed to fill a page using a character-based printer can be calculated this 
way (assuming 60 lines per page each containing 80 characters): 


60 x 80 = 4,800 bytes 


In comparison, a 300 dot per inch (DPJ) laser printer (assuming an 8- 
by 10-inch print area per page) requires this many bytes: 


(8 x 300) x (10 x 300) / 8 = 900,000 bytes 


Many of the slow PC networks simply could not handle the nearly IMB 
of data required to print a full page on a laser printer, so it was clear that a 
clever invention was needed. 

That invention turned out to be the page description language (PDL). A 
page description language is a programming language that describes the 
contents of a page. Basically it says, “Go to this position, draw the character a 
in 10-point Helvetica, go to this position . . .” until everything on the page is 
described. The first major PDL was PostScript from Adobe Systems, which is 
still in wide use today. The PostScript language is a complete programming 
language tailored for typography and other kinds of graphics and imaging. It 
includes built-in support for 35 standard, high-quality fonts, plus the ability to 
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accept additional font definitions at runtime. At first, support for PostScript 
was built into the printers themselves. This solved the data transmission 
problem. While the typical PostScript program was very verbose in compari- 
son to the simple byte stream of character-based printers, it was much smaller 
than the number of bytes required to represent the entire printed page. 

A PostScript printer accepted a PostScript program as input. The printer 
contained its own processor and memory (oftentimes making the printer 
a more powerful computer than the computer to which it was attached) 
and executed a special program called a PostScript interpreter, which read 
the incoming PostScript program and rendered the results into the printer’s 
internal memory, thus forming the pattern of bits (dots) that would be trans- 
ferred to the paper. The generic name for this process of rendering some- 
thing into a large bit pattern (called a bitmap) is raster image processor (RIP). 

As the years went by, both computers and networks became much faster. 
This allowed the RIP to move from the printer to the host computer, which, 
in turn, permitted high-quality printers to be much less expensive. 

Many printers today still accept character-based streams, but many 
low-cost printers do not. They rely on the host computer’s RIP to provide a 
stream of bits to print as dots. There are still some PostScript printers, too. 


Printing with Linux 


Modern Linux systems employ two software suites to perform and manage 
printing. The first, the Common Unix Printing System (CUPS), provides 
print drivers and printjob management, and the second, Ghostscript, a 
PostScript interpreter, acts as a RIP. 

CUPS manages printers by creating and maintaining print queues. As 
we discussed in the earlier history lesson, Unix printing was originally 
designed to manage a centralized printer shared by multiple users. Since 
printers are slow by nature, compared to the computers that are feeding 
them, printing systems need a way to schedule multiple print jobs and keep 
things organized. CUPS also has the ability to recognize different types of 
data (within reason) and can convert files to a printable form. 


Preparing Files for Printing 


Chapter 22 


As command line users, we are mostly interested in printing text, though it 
is certainly possible to print other data formats as well. 


pr—Convert Text Files for Printing 


We looked at pr a little in the previous chapter. Now we will examine some of 
its many options used in conjunction with printing. In our history of printing, 
we saw how character-based printers use monospaced fonts, resulting in fixed 
numbers of characters per line and lines per page. pr is used to adjust text to 
fit on a specific page size, with optional page headers and margins. Table 22-1 
summarizes its most commonly used options. 


Sending 


Table 22-1: Common pr Options 


Option Description 

+first[:last] Output a range of pages starting with first and, optionally, ending 
with last. 

-columns Organize the content of the page into the number of columns speci- 
fied by columns. 

-a By default, multicolumn output is listed vertically. By adding the -a 
(across) option, content is listed horizontally. 

-d Double-space output. 

-D format Format the date displayed in page headers using format. See the man 
page for the date command for a description of the format string. 

-f Use form feeds rather than carriage returns to separate pages. 

-h header In the center portion of the page header, use header rather than the 
name of the file being processed. 

-1 length Set page length to length. The default is 66 (US letter at 6 lines 
per inch). 

-n Number lines. 

-o offset Create a left margin offset characters wide. 

-w width Set the page width to width. The default is 72. 


pr is often used in pipelines as a filter. In this example, we will produce 
a directory listing of /usr/bin and format it into paginated, three-column 
output using pr: 


[me@linuxbox ~]$ 1s /usr/bin | pr -3 -w 65 | head 


2016-02-18 14:00 Page 1 
[ apturl bsd-write 

411toppm ar bsh 

a2p arecord btcflash 

a2ps arecordmidi bug-buddy 
a2ps-lpr-wrapper ark buildhash 


a Print Job to a Printer 


The CUPS printing suite supports two methods of printing historically 
used on Unix-like systems. One method, called Berkeley or LPD (used in 
the Berkeley Software Distribution version of Unix), uses the lpr program, 
while the other method, called SysV (from the System V version of Unix), 
uses the lp program. Both programs do roughly the same thing. Choosing 
one over the other is a matter of personal taste. 
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Ipr—Print Files (Berkeley Style) 


The lpr program can be used to send files to the printer. It may also be 
used in pipelines, as it accepts standard input. For example, to print the 
results of our previous multicolumn directory listing, we could do this: 


[me@linuxbox ~]$ 1s /usr/bin | pr -3 | lpr 


The report would be sent to the system’s default printer. To send the 
file to a different printer, the -P option can be used like this: 


lpr -P printer_name 


Here, printer_name is the name of the desired printer. To see a list of 
printers known to the system, use this: 


[me@linuxbox ~]$ lpstat -a 


TIP Many Linux distributions allow you to define a “printer” that outputs files to PDF, 
rather than printing on the physical printer. This is handy for experimenting with 
printing commands. Check your printer configuration program to see whether it sup- 
ports this configuration. On some distributions, you may need to install additional 
packages (such as cups-pdf) to enable this capability. 


Table 22-2 describes some of the common options for lpr. 


Table 22-2: Common lpr Options 


Option Description 
-# number Set number of copies to number. 
-p Print each page with a shaded header with the date, time, job name, 


and page number. This so-called pretty-print option can be used when 
printing text files. 


-P printer Specify the name of the printer used for output. If no printer is speci- 
fied, the system’s default printer is used. 


-r Delete files after printing. This would be useful for programs that pro- 
duce temporary printer-output files. 


Ip—Print Files (System V Style) 


Like lpr, lp accepts either files or standard input for printing. It differs from 
lpr in that it supports a different (and slightly more sophisticated) option 
set. Table 22-3 describes the common options. 
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[me@linuxbox ~]$ 


Table 22-3: Common 1p Options 


Option Description 

-d printer Set the destination (printer) to printer. If no d option is 
specified, the system default printer is used. 

-n number Set the number of copies to number. 

-o landscape Set output to landscape orientation. 

-o fitplot Scale the file to fit the page. This is useful when printing 


images, such as JPEG files. 


-o scaling=number Scale file to number. The value of 100 fills the page. Values 
less than 100 are reduced, while values greater than 100 
cause the file to be printed across multiple pages. 


-o cpi=number Set the output characters per inch to number. The default is 10. 
-o lpi=number Set the output lines per inch to number. The default is 6. 
-0 page-bottom=points Set the page margins. Values are expressed in points, a 
-o page-left=points unit of typographic measurement. There are 72 points to 
-0 page-right=points an inch. 
-0 page-top=points 
-P pages Specify the list of pages. pages may be expressed as 
a comma-separated list and/or a range, for example, 
1,3,5,7-10. 


We'll produce our directory listing again, this time printing 12 CPI and 
8 LPI with a left margin of one-half inch. Note that we have to adjust the pr 
options to account for the new page size. 


1s /usr/bin | pr -4 -w 90 -1 88 | Ip -o page-left=36 -o cpi=12 -o lpi=8 


This pipeline produces a four-column listing using smaller type than 
the default. The increased number of characters per inch allows us to fit 
more columns on the page. 


Another Option: a2ps 


The a2ps program (available in most distribution repositories) is interest- 
ing. As we can surmise from its name, it’s a format conversion program, but 
it's also much more. Its name originally meant “ASCII to PostScript,” and it 
was used to prepare text files for printing on PostScript printers. Over the 
years, however, the capabilities of the program have grown, and now its name 
means “Anything to PostScript.” While its name suggests a format-conversion 
program, it is actually a printing program. It sends its default output to the 
system’s default printer rather than standard output. The program’s default 
behavior is that of a “pretty printer,” meaning that it improves the appearance 
of output. We can use the program to create a PostScript file on our desktop. 


Printing 303 


304 


Chapter 22 


[me@linuxbox ~]$ 1s /usr/bin | pr -3 -t | a2ps -o ~/Desktop/1s.ps -L 66 
[stdin (plain): 11 pages on 6 sheets] 
[Total: 11 pages on 6 sheets] saved into the file ~/home/me/Desktop/1s.ps' 


Here we filter the stream with pr, using the -t option (omit headers 
and footers), and then with a2ps, specifying an output file (-o option) and 
66 lines per page (-L option) to match the output pagination of pr. If we 
view the resulting file with a suitable file viewer, we will see the output in 
Figure 22-1. 


Printod by Wiliam Shotts 


Jul 23, 18 14:12 Jul 23, 18 14:12 


Figure 22-1: Viewing a2ps output 


As we can see, the default output layout is “two-up” format. This causes 
the contents of two pages to be printed on each sheet of paper. a2ps applies 
nice page headers and footers, too. 

a2ps has a lot of options. Table 22-4 provides a summary. 


Table 22-4: a2ps Options 

Option Description 

--center-title=-text Set center page title to text. 

--columns=number Arrange pages into number columns. The default is 2. 

--footer=text Set page footer to text. 

--guess Report the types of files given as arguments. Since a2ps 
tries to convert and format all types of data, this option 
can be useful for predicting what a2ps will do when given 
a particular file. 

--left-footer=text Set the left-page footer to text. 

--left-title=text Set the left-page title to text. 


Option 
--line-numbers=interval 
--list=defaults 
--pages=range 
--right-footer=text 
--right-title=text 


--rows=number 


-b text 
-f size 


-1 number 


-L number 
-M name 
-n number 


-o file 


-P printer 


-T number 


-u text 


Description 

Number lines of output every interval lines. 
Display default settings. 

Print pages in range. 

Set the right-page footer to text. 

Set the right-page title to text. 

Arrange pages into number rows. The default is 1. 
No page headers. 

Set the page header to text. 

Use size point font. 


Set characters per line to number. This and the -L option 
(see the next entry) can be used to make files paginated 
with other programs, such as pr, fit correctly on the page. 


Set lines per page to number. 
Use a media name, such as A4. 
Output number copies of each page. 


Send output to file. If file is specified as -, use standard 
output. 


Use printer. lf a printer is not specified, the system default 
printer is used. 


Portrait orientation. 
Landscape orientation. 
Set tab stops to every number characters. 


Underlay (watermark) pages with text. 


This is just a summary. a2ps has several more options. 


There is another output formatter that is useful for converting text into PostScript. 
Called enscript, it can perform many of the same kinds of formatting and printing 
tricks, but unlike a2ps, it accepts only text input. 


Monitoring and Controlling Print Jobs 


As Unix printing systems are designed to handle multiple print jobs from 
multiple users, CUPS is designed to do the same. Each printer is given a 
print queue, where jobs are parked until they can be spooled to the printer. 
CUPS supplies several command line programs that are used to manage 
printer status and print queues. Like the lpr and lp programs, these man- 
agement programs are modeled after the corresponding programs from 
the Berkeley and System V printing systems. 
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Ipstat—Display Print System Status 


The Ipstat program is useful for determining the names and availability of 
printers on the system. For example, if we had a system with both a physical 
printer (named printer) and a PDF virtual printer (named PDF), we could 
check their status like this: 


[me@linuxbox ~]$ lpstat -a 
PDF accepting requests since Mon 08 Dec 2017 03:05:59 PM EST 
printer accepting requests since Tue 24 Feb 2018 08:43:22 AM EST 


Further, we could determine a more detailed description of the print 
system configuration this way: 


[me@linuxbox ~]$ lpstat -s 

system default destination: printer 

device for PDF: cups-pdf:/ 

device for printer: ipp://print-server:631/printers/printer 


In this example, we see that printer is the system’s default printer and 
that it is a network printer using Internet Printing Protocol (ipp://) attached 
to a system named print-server. 

Table 22-5 describes some of the commonly useful options. 


Table 22-5: Common lpstat Options 


Option Description 

-a [printer...] Display the state of the printer queue for printer. Note that this is 
the status of the printer queue’s ability to accept jobs, not the status 
of the physical printers. If no printers are specified, all print queues 
are shown. 

-d Display the name of the system's default printer. 

-p [printer...] Display the status of the specified printer. If no printers are speci- 
fied, all printers are shown. 


-r Display the status of the print server. 
aS Display a status summary. 
-t Display a complete status report. 


Ipq—Display Printer Queue Status 


To see the status of a printer queue, the lpq program is used. This allows 
us to view the status of the queue and the print jobs it contains. Here is an 
example of an empty queue for a system default printer named printer: 


[me@linuxbox ~]$ lpq 
printer is ready 
no entries 
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If we do not specify a printer (using the -P option), the system’s default 
printer is shown. If we send a job to the printer and then look at the queue, 
we will see it listed. 


[me@linuxbox ~]$ 1s *.txt | pr -3 | lp 

request id is printer-603 (1 file(s)) 

[me@linuxbox ~]$ lpq 

printer is ready and printing 

Rank Owner Job File(s) Total Size 
active me 603 (stdin) 1024 bytes 


Iprm/cancel—Cancel Print Jobs 


CUPS supplies two programs used to terminate print jobs and remove them 
from the print queue. One is Berkeley style (1prm), and the other is System V 
(cancel). They differ slightly in the options they support but do basically the 
same thing. Using our earlier print job as an example, we could stop the job 
and remove it this way: 


[me@linuxbox ~]$ cancel 603 
[me@linuxbox ~]$ lpq 
printer is ready 

no entries 


Each command has options for removing all the jobs belonging to 
a particular user, particular printer, and multiple job numbers. Their 
respective man pages have all the details. 


Summing Up 


In this chapter, we saw how the printers of the past influenced the design 
of the printing systems on Unix-like machines. We also explored how much 
control is available on the command line to control not only the scheduling 
and execution of print jobs, but also the various output options. 
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COMPILING PROGRAMS 


In this chapter, we will look at how to 
build programs by compiling source code. 


The availability of source code is the essen- 

tial freedom that makes Linux possible. The 
entire ecosystem of Linux development relies on free 
exchange between developers. For many desktop users, 
compiling is a lost art. It used to be quite common, but 
today, distribution providers maintain huge reposito- 
ries of precompiled binaries, ready to download and 
use. At press time, the Debian repository (one of the 
largest of any of the distributions) contains more than 
68,000 packages. 
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So why compile software? There are two reasons. 


e = Availability. Despite the number of precompiled programs in distribu- 
tion repositories, some distributions may not include all the desired 
applications. In this case, the only way to get the desired program is to 
compile it from source. 


e Timeliness. While some distributions specialize in cutting-edge versions 
of programs, many do not. This means that to have the latest version of a 
program, compiling is necessary. 


Compiling software from source code can become quite complex and 
technical and well beyond the reach of many users. However, many compiling 
tasks are easy and involve only a few steps. It all depends on the package. We 
will look at a very simple case to provide an overview of the process and as a 
starting point for those who want to undertake further study. 

We will introduce one new command. 


make Utility to maintain programs 


What Is Compiling? 
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Simply put, compiling is the process of translating source code (the human- 
readable description of a program written by a programmer) into the native 
language of the computer’s processor. 

The computer’s processor (or CPU) works at an elemental level, execut- 
ing programs in what is called machine language. This is a numeric code 
that describes extremely small operations, such as “add this byte,” “point 
to this location in memory,” or “copy this byte.” Each of these instructions 
is expressed in binary (ones and zeros). The earliest computer programs 
were written using this numeric code, which may explain why programmers 
who wrote it were said to smoke a lot, drink gallons of coffee, and wear thick 
glasses. 

This problem was overcome by the advent of assembly language, which 
replaced the numeric codes with (slightly) easier-to-use character mnemonics 
such as CPY (for copy) and MOV (for move). Programs written in assembly 
language are processed into machine language by a program called an 
assembler. Assembly language is still used today for certain specialized pro- 
gramming tasks, such as device drivers and embedded systems. 

We next come to what are called high-level programming languages, which 
allow the programmer to be less concerned with the details of what the pro- 
cessor is doing and more with solving the problem at hand. The early ones 
(developed during the 1950s) include FORTRAN (designed for scientific 
and technical tasks) and COBOL (designed for business applications). Both 
are still in limited use today. 

While there are many popular programming languages, two predomi- 
nate. Most programs written for modern systems are written in either C or 
C++. In the examples to follow, we will be compiling a C program. 


Programs written in high-level programming languages are converted 
into machine language by processing them with another program, called a 
compiler. Some compilers translate high-level instructions into assembly lan- 
guage and then use an assembler to perform the final stage of translation 
into machine language. 

A process often used in conjunction with compiling is called linking. 
There are many common tasks performed by programs. Take, for instance, 
opening a file. Many programs perform this task, but it would be wasteful 
to have each program implement its own routine to open files. It makes 
more sense to have a single piece of programming that knows how to open 
files and to allow all programs that need it to share it. Providing support 
for common tasks is accomplished by what are called libraries. They con- 
tain multiple routines, each performing some common task that multiple 
programs can share. If we look in the /léb and /usr/lib directories, we can 
see where many of them live. A program called a linker is used to form the 
connections between the output of the compiler and the libraries that the 
compiled program requires. The final result of this process is the executable 
program file, ready for use. 


Are All Programs Compiled? 


No. As we have seen, there are programs such as shell scripts that do not 
require compiling. They are executed directly. These are written in what 
are known as scripting or interpreted languages. These languages have grown 
in popularity in recent years and include Perl, Python, PHP, Ruby, and 
many others. 

Scripted languages are executed by a special program called an 
interpreter. An interpreter inputs the program file and reads and executes 
each instruction contained within it. In general, interpreted programs 
execute much more slowly than compiled programs. This is because each 
source code instruction in an interpreted program is translated every 
time it is carried out, whereas with a compiled program, a source code 
instruction is translated only once, and this translation is permanently 
recorded in the final executable file. 

Why are interpreted languages so popular? For many programming 
chores, the results are “fast enough,” but the real advantage is that it is 
generally faster and easier to develop interpreted programs than compiled 
programs. Programs are usually developed in a repeating cycle of code, 
compile, test. As a program grows in size, the compilation phase of the cycle 
can become quite long. Interpreted languages remove the compilation step 
and thus speed up program development. 


Compiling a C Program 


Let’s compile something. Before we do that, however, we’re going to need 
some tools like the compiler, the linker, and make. The C compiler used 
almost universally in the Linux environment is called gcc (GNU C Compiler), 
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originally written by Richard Stallman. Most distributions do not install gcc 
by default. We can check to see whether the compiler is present like this: 


[me@linuxbox ~]$ which gcc 
/usr/bin/gcc 


The results in this example indicate that the compiler is installed. 


Your distribution may have a meta-package (a collection of packages) for software 
development. If so, consider installing it if you intend to compile programs on your 
system. If your system does not provide a meta-package, try installing the gcc and 
make packages. On many distributions, this is sufficient to carry out the following 
exercise. 


Obtaining the Source Code 


For our compiling exercise, we are going to compile a program from the 
GNU Project called diction. This handy little program checks text files for 
writing quality and style. As programs go, it is fairly small and easy to build. 
Following convention, we’re first going to create a directory for our 
source code named src and then download the source code into it using ftp. 


[me@linuxbox ~]$ mkdir src 
[me@linuxbox ~]$ cd src 

[me@linuxbox src]$ ftp ftp.gnu.org 
Connected to ftp.gnu.org. 

220 GNU FTP server ready. 

Name (ftp.gnu.org:me): anonymous 

230 Login successful. 

Remote system type is UNIX. 

Using binary mode to transfer files. 
ftp> cd gnu/diction 

250 Directory successfully changed. 
ftp> 1s 

200 PORT command successful. Consider using PASV. 
150 Here comes the directory listing. 


-IW-Y--Y-- 1 1003 65534 68940 Aug 28 1998 diction-0.7.tar.gz 
-IW-Y--Y-- 1 1003 65534 90957 Mar 04 2002 diction-1.02.tar.gz 
-IW-Y--Y-- 1 1003 65534 141062 Sep 17 2007 diction-1.11.tar.gz 


226 Directory send OK. 

ftp> get diction-1.11.tar.gz 

local: diction-1.11.tar.gz remote: diction-1.11.tar.gz 
200 PORT command successful. Consider using PASV. 

150 Opening BINARY mode data connection for diction-1.11.tar.gz (141062 
bytes). 

226 File send OK. 

141062 bytes received in 0.16 secs (847.4 kB/s) 

ftp> bye 

221 Goodbye. 

[me@linuxbox src]$ 1s 

diction-1.11.tar.gz 


While we used ftp in the previous example, which is traditional, there 
are other ways of downloading source code. For example, the GNU Project 
also supports downloading using HTTPS. We can download the diction 
source code using the wget program. 


[me@linuxbox src]$ wget https://ftp.gnu.org/gnu/diction/diction-1.11.tar.gz 
--2018-07-25 09:42:20-- https://ftp.gnu.org/gnu/diction/diction-1.11. tar.gz 
Resolving ftp.gnu.org (ftp.gnu.org)... 208.118.235.20, 2001:4830:134:3::b 
Connecting to ftp.gnu.org (ftp.gnu.org) |208.118.235.20|:443... connected. 
HTTP request sent, awaiting response... 200 OK 

Length: 141062 (138K) [application/x-gzip] 

Saving to: 'diction-1.11.tar.gz' 

diction-1.11.tar.gz 100%[===================>] 137.76K 

--.-KB/s in 0.09s 


2018-07-25 09:42:20 (1.43 MB/s) - ‘diction-1.11.tar.gz.1' saved [141062/141062] 


Because we are the “maintainer” of this source code while we compile it, we will keep 
it in ~/src. Source code installed by your distribution will be installed in /usr/src, 
while source code we maintain that’s intended for use by multiple users is usually 
installed in /usr/local/src. 


As we can see, source code is usually supplied in the form of a com- 
pressed tar file. Sometimes called a tarball, this file contains the source tree, 
or hierarchy of directories and files that comprise the source code. After 
arriving at the FTP site, we examine the list of tar files available and select 
the newest version for download. Using the get command within ftp, we 
copy the file from the FTP server to the local machine. 

Once the tar file is downloaded, it must be unpacked. This is done with 
the tar program. 


[me@linuxbox src]$ tar xzf diction-1.11.tar.gz 
[me@linuxbox src]$ 1s 
diction-1.11 diction-1.11.tar.gz 


The diction program, like all GNU Project software, follows certain standards for 
source code packaging. Most other source code available in the Linux ecosystem also 
follows this standard. One element of the standard is that when the source code tar 
file is unpacked, a directory will be created that contains the source tree, and this 
directory will be named project-x.xx, thus containing both the project’s name and 
ats version number. This scheme allows easy installation of multiple versions of the 
same program. However, it is often a good idea to examine the layout of the tree before 
unpacking it. Some projects will not create the directory but instead will deliver the 
files directly into the current directory. This will make a mess in our otherwise well- 
organized src directory. To avoid this, use the following command to examine the 
contents of the tar file: 


tar tzvf tarfile | head 
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Examining the Source Tree 
Unpacking the tar file results in the creation of a new directory, named 


diction-1.11. This directory contains the source tree. Let’s look inside. 


[me@linuxbox src]$ cd diction-1.11 
[me@linuxbox diction-1.11]$ 1s 


config.guess diction.c getopt.c nl 
config.h.in diction.pot getopt.h nl.po 
config.sub diction. spec getopt_int.h README 
configure diction.spec.in INSTALL sentence.c 
configure.in diction.texi.in install-sh sentence.h 
COPYING en Makefile.in  style.1.in 
de en_GB misc.c style.c 
de.po en_GB. po misc.h test 
diction.1.in getopti.c NEWS 


In it, we see a number of files. Programs belonging to the GNU Project, 
and many others, will supply the documentation files README, INSTALL, 
NEWS, and COPYING. These files contain the description of the program, 
information on how to build and install it, and its licensing terms. It is always 
a good idea to carefully read the README and INSTALL files before attempt- 
ing to build the program. 

The other interesting files in this directory are the ones ending with 
.cand .h. 


[me@linuxbox diction-1.11]$ 1s *.c 

diction.c getopti.c getopt.c misc.c sentence.c style.c 
[me@linuxbox diction-1.11]$ 1s *.h 

getopt.h getopt_int.h misc.h sentence.h 


The .c files contain the two C programs supplied by the package (style 
and diction), divided into modules. It is common practice for large programs 
to be broken into smaller, easier-to-manage pieces. The source code files are 
ordinary text and can be examined with less. 


[me@linuxbox diction-1.11]$ less diction.c 


The .A files are known as header files. These, too, are ordinary text. 
Header files contain descriptions of the routines included in a source 
code file or library. For the compiler to connect the modules, it must 
receive a description of all the modules needed to complete the entire 
program. Near the beginning of the diction.c file, we see this line: 


#include "getopt.h" 


This instructs the compiler to read the file getopt.h as it reads the source 
code in diction.c to “know” what’s in getopt.c. The getopt.c file supplies routines 
that are shared by both the style and diction programs. 


Before the include statement for getopt.h, we see some other include 
statements such as these: 


#include <regex.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <unistd.h> 


These also refer to header files, but they refer to header files that live out- 
side the current source tree. They are supplied by the system to support the 
compilation of every program. If we look in /usr/include, we can see them. 


[me@linuxbox diction-1.11]$ 1s /usr/include 


The header files in this directory were installed when we installed the 
compiler. 


Building the Program 


Most programs build with a simple, two-command sequence. 


./configure 
make 


The configure program is a shell script that is supplied with the source 
tree. Its job is to analyze the buald environment. Most source code is designed 
to be portable. That is, it is designed to build on more than one kind of Unix- 
like system. But to do that, the source code may need to undergo slight 
adjustments during the build to accommodate differences between systems. 
configure also checks to see that necessary external tools and components are 
installed. 

Let’s run configure. Because configure is not located where the shell 
normally expects programs to be located, we must explicitly tell the shell 
its location by prefixing the command with ./ to indicate that the program 
is located in the current working directory. 


[me@linuxbox diction-1.11]$ ./configure 


configure will output a lot of messages as it tests and configures the 
build. When it finishes, it will look something like this: 


checking libintl.h presence... yes 

checking for libintl.h... yes 

checking for library containing gettext... none required 
configure: creating ./config.status 

config.status: creating Makefile 

config.status: creating diction.1 

config.status: creating diction.texi 

config.status: creating diction.spec 
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config.status: creating style.1 
config.status: creating test/rundiction 
config.status: creating config.h 
[me@linuxbox diction-1.11]$ 


What’s important here is that there are no error messages. If there 
were, the configuration failed, and the program will not build until the 
errors are corrected. 

We see configure created several new files in our source directory. The 
most important one is the makefile. The makefile is a configuration file that 
instructs the make program exactly how to build the program. Without it, make 
will refuse to run. The makefile is an ordinary text file, so we can view it. 


[me@linuxbox diction-1.11]$ less Makefile 


The make program takes as input a makefile (which is normally named 
Makefile), which describes the relationships and dependencies among the 
components that comprise the finished program. 

The first part of the makefile defines variables that are substituted in 
later sections of the makefile. For example we see the following line: 


CC= gcc 


This defines the C compiler to be gcc. Later in the makefile, we see one 
instance where it gets used. 


diction: diction.o sentence.o misc.o getopt.o getopt1.o 
$(CC) -o $@ $(LDFLAGS) diction.o sentence.o misc.o \ 
getopt.o getopti.o $(LIBS) 


A substitution is performed here, and the value $(CC) is replaced by gcc 
at runtime. 

Most of the makefile consists of lines that define a ¢arget—in this case, the 
executable file diction and the files on which it is dependent. The remaining 
lines describe the commands needed to create the target from its compo- 
nents. We see in this example that the executable file diction (one of the end 
products) depends on the existence of diction.o, sentence.o, misc.o, getopt.o, and 
getoptl.o. Later, in the makefile, we see definitions of each of these as targets. 


diction.o: diction.c config.h getopt.h misc.h sentence.h 
getopt.o: getopt.c getopt.h getopt_int.h 

getopt1.o: getopti.c getopt.h getopt_int.h 

misc.o: misc.c config.h misc.h 

sentence.o: sentence.c config.h misc.h sentence.h 
style.o: style.c config.h getopt.h misc.h sentence.h 


However, we don’t see any command specified for them. This is handled 
by a general target, earlier in the file, that describes the command used to 
compile any .c file into an .o file. 


$(CC) -c $(CPPFLAGS) $(CFLAGS) $< 


This all seems very complicated. Why not simply list all the steps to 
compile the parts and be done with it? The answer to this will become clear 
in a moment. In the meantime, let’s run make and build our programs. 


[me@linuxbox diction-1.11]$ make 


The make program will run, using the contents of Makefile to guide its 
actions. It will produce a lot of messages. 

When it finishes, we will see that all the targets are now present in our 
directory. 


[me@linuxbox diction-1.11]$ 1s 


config.guess de.po en install-sh —sentence.c 
config.h diction en_GB Makefile sentence.h 
config.h.in diction.1 en_GB.mo Makefile.in sentence.o 
config. log diction.1.in en_GB.po misc.c style 
config.status diction.c getopti.c misc.h style.1 
config.sub diction.o getopt1.o misc.o style.1.in 
configure diction. pot getopt.c NEWS style.c 
configure.in diction.spec getopt.h nl style.o 
COPYING diction.spec.in getopt_int.h nl.mo test 

de diction. texi getopt.o nl.po 

de.mo diction.texi.in INSTALL README 


Among the files, we see diction and style, the programs that we set out 
to build. Congratulations are in order! We just compiled our first programs 
from source code! 

But just out of curiosity, let’s run make again. 


[me@linuxbox diction-1.11]$ make 
make: Nothing to be done for ~all'. 


It only produces this strange message. What’s going on? Why didn’t it 
build the program again? Ah, this is the magic of make. Rather than simply 
building everything again, make only builds what needs building. With all of 
the targets present, make determined that there was nothing to do. We can 
demonstrate this by deleting one of the targets and running make again to 
see what it does. Let’s get rid of one of the intermediate targets. 


[me@linuxbox diction-1.11]$ rm getopt.o 
[me@linuxbox diction-1.11]$ make 


We see that make rebuilds it and relinks the diction and style programs 
because they depend on the missing module. This behavior also points out 
another important feature of make: it keeps targets up-to-date. make insists 
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that targets be newer than their dependencies. This makes perfect sense 
because a programmer will often update a bit of source code and then 
use make to build a new version of the finished product. make ensures that 
everything that needs building based on the updated code is built. If we 
use the touch program to “update” one of the source code files, we can see 
this happen: 


[me@linuxbox diction-1.11]$ 1s -1 diction getopt.c 
-IYWXI-XY-X 1 me me 37164 2009-03-05 06:14 diction 
-IW-r--r-- 1 me me 33125 2007-03-30 17:45 getopt.c 
[me@linuxbox diction-1.11]$ touch getopt.c 

[me@linuxbox diction-1.11]$ ls -1 diction getopt.c 
-IWXxY-xr-xX 1 me me 37164 2009-03-05 06:14 diction 
-IW-r--r-- 1 me me 33125 2009-03-05 06:23 getopt.c 
[me@linuxbox diction-1.11]$ make 


After make runs, we see that it has restored the target to being newer 
than the dependency. 


[me@linuxbox diction-1.11]$ 1s -1 diction getopt.c 
-YWXI-xr-xX 1 me me 37164 2009-03-05 06:24 diction 
-IWw-r--r-- 1 me me 33125 2009-03-05 06:23 getopt.c 


The ability of make to intelligently build only what needs building is a 
great benefit to programmers. While the time savings may not be apparent 
with our small project, it is very significant with larger projects. Remember, 
the Linux kernel (a program that undergoes continuous modification and 
improvement) contains several million lines of code. 


Installing the Program 


Well-packaged source code will often include a special make target called 
install. This target will install the final product in a system directory for use. 
Usually, this directory is /usr/local/bin, the traditional location for locally built 
software. However, this directory is not normally writable by ordinary users, 
so we must become the superuser to perform the installation. 


[me@linuxbox diction-1.11]$ sudo make install 


After we perform the installation, we can check that the program is 
ready to go. 


[me@linuxbox diction-1.11]$ which diction 
/usx/local/bin/diction 
[me@linuxbox diction-1.11]$ man diction 


There we have it! 


Summing Up 


In this chapter, we saw how three simple commands—./configure, make, 
and make install—can be used to build many source code packages. We 
also saw the important role that make plays in the maintenance of pro- 
grams. The make program can be used for any task that needs to maintain 
a target/dependency relationship, not just for compiling source code. 
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WRITING SHELL SCRIPTS 


WRITING YOUR FIRST SCRIPT 


In the preceding chapters, we assembled 
an arsenal of command line tools. While 
these tools can solve many kinds of comput- 
ing problems, we are still limited to manually 
using them one by one on the command line. Wouldn't 
it be great if we could get the shell to do more of the 


work? We can! By joining our tools together into programs of our own 
design, the shell can carry out complex sequences of tasks all by itself. We 
can enable it to do this by writing shell scripts. 


What Are Shell Scripts? 


In the simplest terms, a shell script is a file containing a series of commands. 
The shell reads this file and carries out the commands as though they have 
been entered directly on the command line. 

The shell is somewhat unique, in that it is both a powerful command 
line interface to the system and a scripting language interpreter. As we will 
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see, most of the things that can be done on the command line can be done 
in scripts, and most of the things that can be done in scripts can be done 
on the command line. 

We have covered many shell features, but we have focused on those fea- 
tures most often used directly on the command line. The shell also provides 
a set of features usually (but not always) used when writing programs. 


How to Write a Shell Script 
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To successfully create and run a shell script, we need to do three things. 


1. Write a script. Shell scripts are ordinary text files. So, we need a text 
editor to write them. The best text editors will provide syntax highlighting, 
allowing us to see a color-coded view of the elements of the script. Syntax 
highlighting will help us spot certain kinds of common errors. vin, gedit, 
kate, and many other editors are good candidates for writing scripts. 


2. Make the script executable. The system is rather fussy about not letting 
any old text file be treated as a program, and for good reason! We need 
to set the script file’s permissions to allow execution. 


3. Put the script somewhere the shell can find it. The shell automatically 
searches certain directories for executable files when no explicit path- 
name is specified. For maximum convenience, we will place our scripts 
in these directories. 


Script File Format 


In keeping with programming tradition, we’ll create a “Hello World” pro- 
gram to demonstrate an extremely simple script. Let’s fire up our text editors 
and enter the following script: 


#!/bin/bash 
# This is our first script. 


echo 'Hello World! ' 


The last line of our script is pretty familiar; it’s just an echo command 
with a string argument. The second line is also familiar. It looks like a 
comment that we have seen used in many of the configuration files we 
have examined and edited. One thing about comments in shell scripts is 
that they may also appear at the ends of lines provided they are preceded 
by at least one whitespace character, like so: 


echo 'Hello World!' # This is a comment too 


Everything from the # symbol onward on the line is ignored. 
Like many things, this works on the command line, too. 


[me@linuxbox ~]$ echo ‘Hello World!' # This is a comment too 
Hello World! 


Though comments are of little use on the command line, they will work. 

The first line of our script is a little mysterious. It looks as if it should be 
a comment since it starts with #, but it looks too purposeful to be just that. 
The #! character sequence is, in fact, a special construct called a shebang. The 
shebang is used to tell the kernel the name of the interpreter that should be 
used to execute the script that follows. Every shell script should include this as 
its first line. 

Let’s save our script file as hello_world. 


Executable Permissions 


The next thing we have to do is make our script executable. This is easily 
done using chmod. 


[me@linuxbox ~]$ 1s -1 hello_world 

-IYw-Y--r-- 1 me me 63 2018-03-07 10:10 hello world 
[me@linuxbox ~]$ chmod 755 hello_world 

[me@linuxbox ~]$ 1s -1 hello_world 

-IWXxY-xr-xX 1 me me 63 2018-03-07 10:10 hello world 


There are two common permission settings for scripts: 755 for scripts 
that everyone can execute and 700 for scripts that only the owner can exe- 
cute. Note that scripts must be readable to be executed. 


Script File Location 


With the permissions set, we can now execute our script. 


[me@linuxbox ~]$ ./hello world 
Hello World! 


For the script to run, we must precede the script name with an explicit 
path. If we don’t, we get this: 


[me@linuxbox ~]$ hello_world 
bash: hello world: command not found 


Why is this? What makes our script different from other programs? 
As it turns out, nothing. Our script is fine. Its location is the problem. 
In Chapter 11, we discussed the PATH environment variable and its effect 
on how the system searches for executable programs. To recap, the sys- 
tem searches a list of directories each time it needs to find an executable 
program, if no explicit path is specified. This is how the system knows to 
execute /bin/ls when we type 1s at the command line. The /bin directory 
is one of the directories that the system automatically searches. The list of 
directories is held within an environment variable named PATH. The PATH 
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variable contains a colon-separated list of directories to be searched. We 
can view the contents of PATH. 


[me@linuxbox ~]$ echo $PATH 
/home/me/bin: /usr/local/sbin: /usr/local/bin: /usr/sbin: /usr/bin:/sbin:/bin:/usr/games 


326 


Chapter 24 


Here we see our list of directories. If our script were located in any of 
the directories in the list, our problem would be solved. Notice the first 
directory in the list, /home/me/bin. Most Linux distributions configure the 
PATH variable to contain a bin directory in the user’s home directory to allow 
users to execute their own programs. So, if we create the bin directory and 
place our script within it, it should start to work like other programs. 


[me@linuxbox ~]$ mkdir bin 
[me@linuxbox ~]$ mv hello world bin 
[me@linuxbox ~]$ hello_world 

Hello World! 


And so it does. 
If the PATH variable does not contain the directory, we can easily add it 
by including this line in our .bashre file: 


export PATH=~/bin: "$PATH" 


After this change is made, it will take effect in each new terminal session. 
To apply the change to the current terminal session, we must have the shell 
reread the .bashrc file. This can be done by “sourcing” it. 


[me@linuxbox ~]$ . .bashre 


The dot (.) command is a synonym for the source command, a shell 
builtin that reads a specified file of shell commands and treats it like input 
from the keyboard. 


Ubuntu (and most other Debian-based distributions) automatically adds the ~/bin 
directory to the PATH variable if the ~/bin directory exists when the user’s .bashrc file 
as executed. So, on Ubuntu systems, if we create the~/bin directory and then log out 
and log in again, everything works. 


Good Locations for Scripts 


The ~/bin directory is a good place to put scripts intended for personal 
use. If we write a script that everyone on a system is allowed to use, the 
traditional location is /usr/local/bin. Scripts intended for use by the system 
administrator are often located in /usr/local/sbin. In most cases, locally sup- 
plied software, whether scripts or compiled programs, should be placed in 
the /usr/local hierarchy and not in /bin or /usr/bin. These directories are 
specified by the Linux Filesystem Hierarchy Standard to contain only files 
supplied and maintained by the Linux distributor. 


More Formatting Tricks 


One of the key goals of serious script writing is ease of maintenance; that 
is, the ease with which a script may be modified by its author or others to 
adapt it to changing needs. Making a script easy to read and understand is 
one way to facilitate easy maintenance. 


Long Option Names 


Many of the commands we have studied feature both short and long 
option names. For instance, the 1s command has many options that can 
be expressed in either short or long form. For example, the following: 


[me@linuxbox ~]$ ls -ad 


is equivalent to this: 


[me@linuxbox ~]$ 1s --all --directory 


In the interests of reduced typing, short options are preferred when 
entering options on the command line, but when writing scripts, long 
options can provide improved readability. 


Indentation and Line Continuation 


When employing long commands, readability can be enhanced by spreading 
the command over several lines. In Chapter 17, we looked at a particularly 
long example of the find command. 


[me@linuxbox ~]$ find playground \( -type f -not -perm 0600 -exec 
chmod 0600 '{}' ';' \) -or \( -type d -not -perm 0700 -exec chmod 
0700 '{}' ';' \) 


Obviously, this command is a little hard to figure out at first glance. In 
a script, this command might be easier to understand if written this way: 


find playground \ 
\C\ 
-type f \ 
-not -perm 0600 \ 
-exec chmod 0600 '{}' ';' \ 
\) \ 
-or \ 
\C\ 
-type d \ 
-not -perm 0700 \ 
-exec chmod 0700 '{}' ';' \ 


\) 


By using line continuations (backslash-linefeed sequences) and inden- 
tation, the logic of this complex command is more clearly described to the 
reader. This technique works on the command line, though it is seldom used, 
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as it is awkward to type and edit. One difference between a script and a com- 
mand line is that the script may employ tab characters to achieve indentation, 
whereas the command line cannot since tabs are used to activate completion. 


CONFIGURING VIM FOR SCRIPT WRITING 


The vim text editor has many, many configuration settings. There are several 
common options that can facilitate script writing. 
The following turns on syntax highlighting: 


:syntax on 


With this setting, different elements of shell syntax will be displayed in dif- 
ferent colors when viewing a script. This is helpful for identifying certain kinds 
of programming errors. It looks cool, too. Note that for this feature to work, you 
must have a complete version of vim installed, and the file you are editing must 
have a shebang indicating the file is a shell script. If you have difficulty with the 
previous command, try :set syntax=sh instead. 

This turns on the option to highlight search results: 


:set hlsearch 


Say we search for the word echo. With this option on, each instance of 
the word will be highlighted. 
The following sets the number of columns occupied by a tab character: 


:set tabstop=4 


The default is eight columns. Setting the value to 4 (which is a common 
practice) allows long lines to fit more easily on the screen. 
The following turns on the “auto indent” feature: 


sset autoindent 


This causes vim to indent a new line the same amount as the line just typed. 


This speeds up typing on many kinds of programming constructs. To stop indenta- 


tion, press CTRL-D. 
These changes can be made permanent by adding these commands (with- 
out the leading colon characters) to your ~/vimrc file. 


Summing Up 


In this first chapter of scripting, we looked at how scripts are written and 
made to easily execute on our system. We also saw how we can use various 
formatting techniques to improve the readability (and thus the maintain- 
ability) of our scripts. In future chapters, ease of maintenance will come up 
again and again as a central principle in good script writing. 
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STARTING A PROJECT 


Starting with this chapter, we will begin 
to build a program. The purpose of this 


project is to see how various shell features 
are used to create programs and, more impor- 
tantly, create good programs. 


The program we will write is a report generator. It will present various 
statistics about our system and its status and will produce this report in 
HTML format so we can view it with a web browser such as Firefox or 
Chrome. 

Programs are usually built up in a series of stages, with each stage 
adding features and capabilities. The first stage of our program will pro- 
duce a minimal HTML document that contains no system information. 
That will come later. 
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The first thing we need to know is the format of a well-formed HTML docu- 
ment. It looks like this: 


<html> 
<head> 
<title>Page Title</title> 
</head> 
<body> 
Page body. 
</body> 
</html> 


If we enter this into our text editor and save the file as foo.himl, we can use 
the following URL in Firefox to view the file: file:///home/username/foo. html. 

The first stage of our program will be able to output this HTML file to 
standard output. We can write a program to do this pretty easily. Let’s start 
our text editor and create a new file named ~/bin/sys_info_page. 


[me@linuxbox ~]$ vim ~/bin/sys_info_page 


Enter the following program: 


#!/bin/bash 
# Program to output a system information page 


echo "<html>" 


echo " <head>" 

echo " <title>Page Title</title>" 
echo " </head>" 

echo " <body>" 

echo " Page body." 

echo " </body>" 


echo "</html>" 


Our first attempt at this problem contains a shebang, a comment (always 
a good idea), and a sequence of echo commands, one for each line of output. 
After saving the file, well make it executable and attempt to run it. 


[me@linuxbox ~]$ chmod 755 ~/bin/sys_info_page 
[me@linuxbox ~]$ sys_info_page 


When the program runs, we should see the text of the HTML docu- 
ment displayed on the screen, because the echo commands in the script 
send their output to standard output. We’ll run the program again and 
redirect the output of the program to the file sys_info_page.himl so that we 
can view the result with a web browser. 


[me@linuxbox ~]$ sys_info_page > sys_info_page.html 
[me@linuxbox ~]$ firefox sys_info_page.html 


So far, so good. 

When writing programs, it’s always a good idea to strive for simplicity 
and clarity. Maintenance is easier when a program is easy to read and under- 
stand, not to mention that it can make the program easier to write by reduc- 
ing the amount of typing. Our current version of the program works fine, 
but it could be simpler. We could actually combine all the echo commands 
into one, which will certainly make it easier to add more lines to the pro- 
gram’s output. So, let’s change our program to this: 


#!/bin/bash 
# Program to output a system information page 


echo "<html> 
<head> 
<title>Page Title</title> 
</head> 
<body> 
Page body. 
</body> 
</html>" 


A quoted string may include newlines and, therefore, contain multiple 
lines of text. The shell will keep reading the text until it encounters the 
closing quotation mark. It works this way on the command line, too: 


me@linuxbox ~]$ echo "<html> 
<head> 
<title>Page Title</title> 


<body> 
Page body. 
</body> 


[ 

> 

> 

> </head> 
> 

> 

> 

> </html>" 


The leading > character is the shell prompt contained in the PS2 shell 
variable. It appears whenever we type a multiline statement into the shell. 
This feature is a little obscure right now, but later, when we cover multiline 
programming statements, it will turn out to be quite handy. 


Second Stage: Adding a Little Data 


Now that our program can generate a minimal document, let’s put some 
data in the report. To do this, we will make the following changes: 


#!/bin/bash 
# Program to output a system information page 


echo "<html> 
<head> 
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<title>System Information Report</title> 
</head> 
<body> 
<hi>System Information Report</h1> 
</body> 
</html>" 


We added a page title and a heading to the body of the report. 


Variables and Constants 
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There is an issue with our script, however. Notice how the string System 
Information Report is repeated? With our tiny script it’s not a problem, but 
let’s imagine that our script was really long and we had multiple instances 
of this string. If we wanted to change the title to something else, we would 
have to change it in multiple places, which could be a lot of work. What if 
we could arrange the script so that the string appeared only once and not 
multiple times? That would make future maintenance of the script much 
easier. Here’s how we could do that: 


#!/bin/bash 
# Program to output a system information page 
title="System Information Report" 


echo "<html> 
<head> 
<title>$title</title> 
</head> 
<body> 
<h1>$title</h1> 
</body> 
</html>" 


By creating a variable named title and assigning it the value System 
Information Report, we can take advantage of parameter expansion and place 
the string in multiple locations. 

So, how do we create a variable? Simple, we just use it. When the shell 
encounters a variable, it automatically creates it. This differs from many pro- 
gramming languages in which variables must be explicitly declared or defined 
before use. The shell is very lax about this, which can lead to some problems. 
For example, consider this scenario played out on the command line: 


[me@linuxbox ~]$ foo="yes" 
[me@linuxbox ~]$ echo $foo 
yes 

[me@linuxbox ~]$ echo $fool 


[me@linuxbox ~]$ 


We first assign the value yes to the variable foo, and then we display its 
value with echo. Next we display the value of the variable name misspelled 
as fool and get a blank result. This is because the shell happily created 
the variable fool when it encountered it and gave it the default value of 
nothing, or empty. From this, we learn that we must pay close attention 
to our spelling! It’s also important to understand what really happened 
in this example. From our previous look at how the shell performs expan- 
sions, we know that the following command: 


[me@linuxbox ~]$ echo $foo 


undergoes parameter expansion and results in the following: 


[me@linuxbox ~]$ echo yes 


By contrast, the following command: 


[me@linuxbox ~]$ echo $fool 


expands into this: 


[me@linuxbox ~]$ echo 


The empty variable expands into nothing! This can play havoc with 
commands that require arguments. Here’s an example: 


[me@linuxbox ~]$ foo=foo.txt 

[me@linuxbox ~]$ foo1=foo1.txt 

[me@linuxbox ~]$ cp $foo $fool 

cp: missing destination file operand after ~foo.txt' 
Try “cp --help' for more information. 


We assign values to two variables, foo and foo1. We then perform a cp 
but misspell the name of the second argument. After expansion, the cp 
command is sent only one argument, though it requires two. 

There are some rules about variable names: 


e Variable names may consist of alphanumeric characters (letters and 
numbers) and underscore characters. 


e =6The first character of a variable name must be either a letter or an 
underscore. 


e Spaces and punctuation symbols are not allowed. 


The word variable implies a value that changes, and in many applications, 
variables are used this way. However, the variable in our application, title, 
is used as a constant. A constant is just like a variable in that it has a name 
and contains a value. The difference is that the value of a constant does not 
change. In an application that performs geometric calculations, we might 
define PI as a constant and assign it the value of 3.1415, instead of using the 


Starting a Project 333 


334 


Chapter 25 


number literally throughout our program. The shell makes no distinction 
between variables and constants; they are mostly for the programmer’s 
convenience. A common convention is to use uppercase letters to designate 
constants and lowercase letters for true variables. We will modify our script 
to comply with this convention: 


#!/bin/bash 
# Program to output a system information page 
TITLE="System Information Report For $HOSTNAME" 


echo "<html> 
<head> 
<title>$TITLE</title> 
</head> 
<body> 
<h1>$TITLE</h1> 
</body> 
</html>" 


We also took the opportunity to jazz up our title by adding the value of 
the shell variable HOSTNAME. This is the network name of the machine. 


The shell actually does provide a way to enforce the immutability of constants, 
through the use of the declare built-in command with the -r (read-only) option. 
Had we assigned TITLE this way: 


declare -r TITLE="Page Title" 


the shell would prevent any subsequent assignment to TITLE. This feature is rarely 
used, but it exists for very formal scripts. 


Assigning Values to Variables and Constants 


Here is where our knowledge of expansion really starts to pay off. As we 
have seen, variables are assigned values this way: 


variable=value 


where variable is the name of the variable and value is a string. Unlike some 
other programming languages, the shell does not care about the type of data 
assigned to a variable; it treats them all as strings. You can force the shell to 
restrict the assignment to integers by using the declare command with the -i 
option, but, like setting variables as read-only, this is rarely done. 

Note that in an assignment, there must be no spaces between the vari- 
able name, the equal sign, and the value. So, what can the value consist of? 
It can have anything that we can expand into a string. 


a=Z # Assign the string "z" to variable a. 
b="a string" # Embedded spaces must be within quotes. 


c="a string and $b" # Other expansions such as variables can be 
# expanded into the assignment. 

d="$(1s -1 foo.txt)" # Results of a command. 

e=$((5 * 7)) # Arithmetic expansion. 

f="\t\ta string\n" # Escape sequences such as tabs and newlines. 


Multiple variable assignments may be done on a single line. 


a=5 b="a string" 


During expansion, variable names may be surrounded by optional 
braces, {}. This is useful in cases where a variable name becomes ambigu- 
ous because of its surrounding context. Here, we try to change the name 
of a file from myfile to myfilel, using a variable: 


[me@linuxbox ~]$ filename="myfile" 

[me@linuxbox ~]$ touch "$filename" 

[me@linuxbox ~]$ mv "$filename" "$filename1" 

mv: missing destination file operand after ~myfile' 
Try “mv --help' for more information. 


This attempt fails because the shell interprets the second argument of 
the mv command as a new (and empty) variable. The problem can be over- 
come this way: 


[me@linuxbox ~]$ mv "$filename" "${filename}1" 


By adding the surrounding braces, the shell no longer interprets the 
trailing 1 as part of the variable name. 


It’s good practice is to enclose variables and command substitutions in double quotes 
to limit the effects of word-splitting by the shell. Quoting is especially important when 
a variable might contain a filename. 


We'll take this opportunity to add some data to our report: namely, the 
date and time the report was created and the username of the creator. 


#!/bin/bash 
# Program to output a system information page 


TITLE="System Information Report For $HOSTNAME" 
CURRENT_TIME="$(date +"%x %r %Z")" 
TIMESTAMP="Generated $CURRENT_ TIME, by $USER" 


echo "<html> 
<head> 
<title>$TITLE</title> 
</head> 
<body> 
<h1>$TITLE</h1> 
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<p>$TIMESTAMP</p> 
</body> 
</html>" 


Here Documents 
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We've looked at two different methods of outputting our text, both using the 
echo command. There is a third way called a here document or here script. A here 
document is an additional form of I/O redirection in which we embed a body 
of text into our script and feed it into the standard input of a command. It 
works like this: 


command << token 
text 
token 


where command is the name of command that accepts standard input and 
token is a string used to indicate the end of the embedded text. Here we’ll 
modify our script to use a here document: 


#!/bin/bash 
# Program to output a system information page 


TITLE="System Information Report For $HOSTNAME" 
CURRENT TIME="$(date +"%x %r %Z")" 
TIMESTAMP="Generated $CURRENT TIME, by $USER" 


cat << _EOF_ 
<html> 
<head> 
<title>$TITLE</title> 
</head> 
<body> 
<h1>$TITLE</h1> 
<p>$TIMESTAMP</p> 
</body> 
</html> 
_EOF_ 


Instead of using echo, our script now uses cat and a here document. The 
string FOF_ (meaning end of file, a common convention) was selected as the 
token and marks the end of the embedded text. Note that the token must 
appear alone and that there must not be trailing spaces on the line. 

So, what’s the advantage of using a here document? It’s mostly the same as 
echo, except that, by default, single and double quotes within here documents 
lose their special meaning to the shell. Here is a command line example: 


[me@linuxbox ~]$ foo="some text" 
[me@linuxbox ~]$ cat << _EOF_ 


> $foo 

> "$foo" 

> '$foo' 

> \$foo 

> _EOF_ 
some text 
"some text" 
"some text' 
$foo 


As we can see, the shell pays no attention to the quotation marks. It 
treats them as ordinary characters. This allows us to embed quotes freely 
within a here document. This could turn out to be handy for our report 
program. 

Here documents can be used with any command that accepts standard 
input. In this example, we use a here document to pass a series of commands 
to the ftp program to retrieve a file from a remote FTP server: 


#!/bin/bash 
# Script to retrieve a file via FTP 


FTP_SERVER=ftp.nl.debian.org FTP_PATH=/debian/dists/stretch/main/installer- 
amd64/current/images/cdrom REMOTE_FILE=debian-cd_info. tar.gz 


ftp -n << _EOF_ 

open $FTP_SERVER 

user anonymous me@linuxbox 
cd $FTP_PATH 

hash 

get $REMOTE FILE 

bye 

_EOF_ 

Is -1 "$REMOTE_FILE" 


If we change the redirection operator from << to <<-, the shell will 
ignore leading tab characters (but not spaces) in the here document. This 
allows a here document to be indented, which can improve readability. 


#!/bin/bash 
# Script to retrieve a file via FTP 


FTP_SERVER=ftp.nl.debian.org 
FTP_PATH=/debian/dists/stretch/main/installer-amd64/current/images/cdrom 
REMOTE_FILE=debian-cd_info.tar.gz 


ftp -n <<- _EOF_ 
open $FTP_SERVER 
user anonymous me@linuxbox 
cd $FTP_PATH 
hash 
get $REMOTE FILE 
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bye 
_EOF_ 


Is -1 "$REMOTE_FILE" 


This feature can be somewhat problematic because many text editors 
(and programmers themselves) will prefer to use spaces instead of tabs to 
achieve indentation in their scripts. 


Summing Up 


In this chapter, we started a project that will carry us through the process of 

building a successful script. We introduced the concept of variables and con- 
stants and how they can be employed. They are the first of many applications 
we will find for parameter expansion. We also looked at how to produce out- 

put from our script and various methods for embedding blocks of text. 
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TOP-DOWN DESIGN 


As programs get larger and more complex, 

they become more difficult to design, code, 
and maintain. As with any large project, it is 

often a good idea to break large, complex tasks 

into a series of small, simple tasks. Let’s imagine we 
are trying to describe a common, everyday task, going 
to the market to buy food, to a person from Mars. We 
might describe the overall process as the following 
series of steps: 


Get in car. 
Drive to market. 
Park car. 


Enter market. 
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Purchase food. 
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Return to car. 
Drive home. 


Park car. 


Cant 


Enter house. 


However, a person from Mars is likely to need more detail. We could 
further break down the subtask “Park car” into this series of steps: 


Find parking space. 
Drive car into space. 
Turn off motor. 

Set parking brake. 


Exit car. 
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Lock car. 


The “Turn off motor” subtask could further be broken down into steps 
including “Turn off ignition,” “Remove ignition key,” and so on, until every 
step of the entire process of going to the market has been fully defined. 

This process of identifying the top-level steps and developing increasingly 
detailed views of those steps is called top-down design. This technique allows us 
to break large complex tasks into many small, simple tasks. Top-down design 
isa common method of designing programs and one that is well suited to 
shell programming in particular. 

In this chapter, we will use top-down design to further develop our 
report-generator script. 
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Our script currently performs the following steps to generate the HTML 
document: 


Open page. 

Open page header. 
Set page title. 

Close page header. 
Open page body. 
Output page heading. 
Output timestamp. 
Close page body. 
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Close page. 


For our next stage of development, we will add some tasks between 
steps 7 and 8. These will include the following: 


e System uptime and load. This is the amount of time since the last shut- 
down or reboot and the average number of tasks currently running on 
the processor over several time intervals. 


e Disk space. This is the overall use of space on the system’s storage 
devices. 


e Home space. This is the amount of storage space being used by each user. 


If we had a command for each of these tasks, we could add them to our 
script simply through command substitution. 


#!/bin/bash 
# Program to output a system information page 
TITLE="System Information Report For $HOSTNAME" 


CURRENT_TIME="$(date +"%x %r %Z")" 
TIMESTAMP="Generated $CURRENT TIME, by $USER" 


cat << _EOF_ 
<html> 
<head> 
<title>$TITLE</title> 
</head> 
<body> 
<hi>$TITLE</h1> 
<p>$TIMESTAMP</p> 
$(report_uptime) 
$(report_disk_space) 
$(report_home_space) 
</body> 
</html> 
_EOF_ 


We could create these additional commands in two ways. We could 
write three separate scripts and place them in a directory listed in our PATH, 
or we could embed the scripts within our program as shell functions. As we 
have mentioned, shell functions are “mini-scripts” that are located inside 
other scripts and can act as autonomous programs. Shell functions have 
two syntactic forms. First, here is the more formal form: 


function name { 
commands 
return 
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Here is the simpler (and generally preferred) form: 


name () { 
commands 
return 

} 


where name is the name of the function and commands is a series of commands 
contained within the function. Both forms are equivalent and may be used 
interchangeably. The following is a script that demonstrates the use of a 
shell function: 


1 #!/bin/bash 

2 

3 # Shell function demo 
4 

5 function step2 { 

6 echo "Step 2" 

7 return 

8 

9 

10 # Main program starts here 
11 

12 echo "Step 1" 

13 step2 

14 echo "Step 3" 


As the shell reads the script, it passes over lines 1 through 11 because 
those lines consist of comments and the function definition. Execution 
begins at line 12, with an echo command. Line 13 calls the shell function 
step2, and the shell executes the function just as it would any other com- 
mand. Program control then moves to line 6, and the second echo command 
is executed. Line 7 is executed next. Its return command terminates the 
function and returns control to the program at the line following the func- 
tion call (line 14), and the final echo command is executed. Note that for 
function calls to be recognized as shell functions and not interpreted as the 
names of external programs, shell function definitions must appear in the 
script before they are called. 

We'll add minimal shell function definitions to our script, shown here: 


#!/bin/bash 
# Program to output a system information page 


TITLE="System Information Report For $HOSTNAME" 
CURRENT TIME="$(date +"%x %r %Z")" 
TIMESTAMP="Generated $CURRENT TIME, by $USER" 


report_uptime () { 
return 


} 


report_disk_space () { 


return 
} 
report_home_space () { 
return 
} 
cat << _EOF_ 
<html> 
<head> 
<title>$TITLE</title> 
</head> 
<body> 
<hi>$TITLE</h1> 
<p>$TIMESTAMP</p> 
$(report_uptime) 
$(report_disk_space) 
$(report_home_space) 
</body> 
</html> 
_EOF_ 


Shell function names follow the same rules as variables. A function must 
contain at least one command. The return command (which is optional) sat- 
isfies the requirement. 


Local Variables 


In the scripts we have written so far, all the variables (including constants) 
have been global variables. Global variables maintain their existence through- 
out the program. This is fine for many things, but it can sometimes compli- 
cate the use of shell functions. Inside shell functions, it is often desirable to 
have local variables. Local variables are accessible only within the shell func- 
tion in which they are defined and cease to exist once the shell function 
terminates. 

Having local variables allows the programmer to use variables with 
names that may already exist, either in the script globally or in other shell 
functions, without having to worry about potential name conflicts. 

Here is an example script that demonstrates how local variables are 
defined and used: 


#!/bin/bash 

# local-vars: script to demonstrate local variables 
foo=-0 # global variable foo 

funct_1 () { 


local foo # variable foo local to funct_1 
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foo=1 
echo "funct_1: foo = $foo" 


} 

funct_2 () { 
local foo # variable foo local to funct_2 
foo=2 
echo "funct_2: foo = $foo" 

} 

echo "global: foo = $foo" 

funct_1 

echo "global: foo = $foo" 

funct_2 


echo "global: foo = $foo" 


As we can see, local variables are defined by preceding the variable 
name with the word local. This creates a variable that is local to the shell 
function in which it is defined. Once outside the shell function, the variable 
no longer exists. When we run this script, we see these results: 


[me@linuxbox ~]$ local-vars 
global: foo = 
funct_1: foo = 
global: foo = 
funct_2: foo = 


global: foo = 
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We see that the assignment of values to the local variable foo within both 
shell functions has no effect on the value of foo defined outside the functions. 

This feature allows shell functions to be written so that they remain inde- 
pendent of each other and of the script in which they appear. This is valuable 
because it helps prevent one part of a program from interfering with another. 
It also allows shell functions to be written so that they can be portable. That 
is, they may be cut and pasted from script to script, as needed. 


Keep Scripts Running 


While developing our program, it is useful to keep the program in a 
runnable state. By doing this, and testing frequently, we can detect errors 
early in the development process. This will make debugging problems 
much easier. For example, if we run the program, make a small change, 
then run the program again and find a problem, it’s likely that the most 
recent change is the source of the problem. By adding the empty func- 
tions, called stubs in programmer-speak, we can verify the logical flow of 
our program at an early stage. When constructing a stub, it’s a good idea 
to include something that provides feedback to the programmer, which 
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shows the logical flow is being carried out. If we look at the output of our 


script now: 


[me@linuxbox ~]$ sys_info_page 
<html> 
<head> 
<title>System Information Report For twin2</title> 
</head> 
<body> 
<hi>System Information Report For linuxbox</h1> 
<p>Generated 03/19/2018 04:02:10 PM EDT, by me</p> 


</body> 
</html> 


we see that there are some blank lines in our output after the timestamp, 
but we can’t be sure of the cause. If we change the functions to include 


some feedback: 


report_uptime () { 
echo "Function report_uptime executed." 
return 


} 


report_disk_space () { 
echo "Function report_disk_space executed." 
return 


} 


report_home_space () { 
echo "Function report_home_space executed." 
return 


and run the script again: 


[me@linuxbox ~]$ sys_info_page 
<html> 
<head> 


<title>System Information Report For linuxbox</title> 


</head> 

<body> 
<hi>System Information Report For linuxbox</h1> 
<p>Generated 03/20/2018 05:17:26 AM EDT, by me</p> 
Function report_uptime executed. 
Function report_disk_space executed. 
Function report_home_space executed. 

</body> 

</html> 


we now see that, in fact, our three functions are being executed. 
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With our function framework in place and working, it’s time to flesh 
out some of the function code. First, here’s the report_uptime function: 


report_uptime () { 


cat <<- _EOF_ 
<h2>System Uptime</h2> 
<pre>$(uptime)</pre> 
_EOF_ 

return 


It’s pretty straightforward. We use a here document to output a section 
header and the output of the uptime command, surrounded by <pre> tags to 
preserve the formatting of the command. The report_disk_space function is 
similar. 


report_disk_space () { 
cat <<- _EOF_ 
<h2>Disk Space Utilization</h2> 
<pre>$(df -h)</pre> 
_EOF_ 
return 


This function uses the df -h command to determine the amount of disk 
space. Lastly, we'll build the report_home_space function. 


report_home_space () { 
cat <<- _EOF_ 
<h2>Home Space Utilization</h2> 
<pre>$(du -sh /home/*)</pre> 
_EOF_ 
return 


We use the du command with the -sh options to perform this task. This, 
however, is not a complete solution to the problem. While it will work on 
some systems (Ubuntu, for example), it will not work on others. The reason 
is that many systems set the permissions of home directories to prevent them 
from being world-readable, which is a reasonable security measure. On these 
systems, the report_home_space function, as written, will work only if our script 
is run with superuser privileges. A better solution would be to have the script 
adjust its behavior according to the privileges of the user. We will take this up 
in the next chapter. 


SHELL FUNCTIONS IN YOUR .BASHRC FILE 


Shell functions make excellent replacements for aliases and are actually the 
preferred method of creating small commands for personal use. Aliases are 
limited in the kind of commands and shell features they support, whereas shell 


functions allow anything that can be scripted. For example, if we liked the 


report_disk_ space shell function that we developed for our script, we could 
create a similar function named ds for our .bashrc file: 


ds () { 
echo "Disk Space Utilization For $HOSTNAME" 


df -h 


Summing Up 


In this chapter, we introduced a common method of program design 
called top-down design, and we saw how shell functions are used to build 
the stepwise refinement that it requires. We also saw how local variables 
can be used to make shell functions independent from one another and 
from the program in which they are placed. This makes it possible for shell 
functions to be written in a portable manner and to be reusable by allowing 
them to be placed in multiple programs; this is a great time saver. 
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FLOW CONTROL: 
BRANCHING WITH IF 


In the previous chapter, we were presented 
with a problem. How can we make our 
report-generator script adapt to the privi- 
leges of the user running the script? The solu- 
tion to this problem will require us to find a way to 
“change directions” within our script, based on the 
results of a test. In programming terms, we need the 


program to branch. 


Let’s consider a simple example of logic expressed in pseudocode, a simu- 
lation of a computer language intended for human consumption. 
X=5 
If X = 5, then: 
Say “X equals 5.” 
Otherwise: 
Say “X is not equal to 5.” 


This is an example of a branch. Based on the condition “Does X = 5?” 
do one thing, “Say X equals 5,” and otherwise do another thing, “Say X is 
not equal to 5.” 


if Statements 


Using the shell, we can code the previous logic as follows: 


x=5 


if [ "$x" -eq 5 ]; then 

echo "x equals 5." 
else 

echo "x does not equal 5." 
fi 


Or we can enter it directly at the command line (slightly shortened). 


[me@linuxbox ~]$ x=5 

[me@linuxbox ~]$ if [ "$x" -eq 5 ]; then echo "equals 5"; else echo "does not equal 5"; 
fi 

equals 5 

[me@linuxbox ~]$ x=0 

[me@linuxbox ~]$ if [ "$x" -eq 5 ]; then echo "equals 5"; else echo "does not equal 5"; 
fi 

does not equal 5 


In this example, we execute the command twice: once, with the value 
of x set to 5, which results in the string “equals 5” being output, and the 
second time with the value of x set to 0, which results in the string “does 
not equal 5” being output. 

The if statement has the following syntax: 


if commands; then 
commands 

[elif commands; then 
commands... | 

[else 
commands | 

fi 


where commands is a list of commands. This is a little confusing at first glance. 
But before we can clear this up, we have to look at how the shell evaluates 
the success or failure of a command. 


Exit Status 


Commands (including the scripts and shell functions we write) issue a value 
to the system when they terminate, called an exit status. This value, which 
is an integer in the range of 0 to 255, indicates the success or failure of the 
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command’s execution. By convention, a value of zero indicates success and 
any other value indicates failure. The shell provides a parameter that we 
can use to examine the exit status. Here we see it in action: 


[me@linuxbox ~]$ 1s -d /usr/bin 

/usr/bin 

[me@linuxbox ~]$ echo $? 

0 

[me@linuxbox ~]$ 1s -d /bin/usr 

Is: cannot access /bin/usr: No such file or directory 
[me@linuxbox ~]$ echo $? 
2 


In this example, we execute the 1s command twice. The first time, the 
command executes successfully. If we display the value of the parameter $?, 
we see that it is zero. We execute the 1s command a second time (specifying 
a nonexistent directory), producing an error, and examine the parameter 
$? again. This time it contains a 2, indicating that the command encoun- 
tered an error. Some commands use different exit status values to provide 
diagnostics for errors, while many commands simply exit with a value of 
1 when they fail. Man pages often include a section entitled “Exit Status,” 
describing what codes are used. However, a zero always indicates success. 

The shell provides two extremely simple builtin commands that do 
nothing except terminate with either a 0 or 1 exit status. The true com- 
mand always executes successfully, and the false command always exe- 
cutes unsuccessfully. 


me@linuxbox ~]$ true 
me@linuxbox ~]$ echo $? 


me@linuxbox ~]$ false 
me@linuxbox ~]$ echo $? 


[ 
[ 
0 
l 
[ 
1 


We can use these commands to see how the if statement works. What 
the if statement really does is evaluate the success or failure of commands. 


[me@linuxbox ~]$ if true; then echo "It's true."; fi 
It's true. 

[me@linuxbox ~]$ if false; then echo "It's true."; fi 
[me@linuxbox ~]$ 


The command echo "It's true." is executed when the command follow- 
ing if executes successfully and is not executed when the command following 
if does not execute successfully. If a list of commands follows if, the last com- 
mand in the list is evaluated. 


[me@linuxbox ~]$ if false; true; then echo "It's true."; fi 
It's true. 

[me@linuxbox ~]$ if true; false; then echo "It's true."; fi 
[me@linuxbox ~]$ 
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By far, the command used most frequently with if is test. The test com- 
mand performs a variety of checks and comparisons. It has two equivalent 
forms. The first, shown here: 


test expression 


And the second, more popular form, here: 


[ expression ] 


where expression is an expression that is evaluated as either true or false. 
The test command returns an exit status of 0 when the expression is true 
and a status of 1 when the expression is false. 

It is interesting to note that both test and [ are actually commands. In 
bash they are builtins, but they also exist as programs in /usr/bin for use with 
other shells. The expression is actually just its arguments with the [ com- 
mand requiring that the ] character be provided as its final argument. 

The test and [ commands support a wide range of useful expressions 
and tests. 


File Expressions 


Table 27-1 lists the expressions used to evaluate the status of files. 


Table 27-1: test File Expressions 


Expression Is true if: 


filet -ef file2 _file1 and file2 have the same inode numbers (the two filenames 
refer to the same file by hard linking). 


file1 -nt file2 file1 is newer than file2. 
file1 -ot file2 _ file is older than file2. 


-b file file exists and is a block-special (device) file. 

-c file file exists and is a character-special (device) file. 

-d file file exists and is a directory. 

-e file file exists. 

-f file file exists and is a regular file. 

-g file file exists and is set-group-ID. 

-G file file exists and is owned by the effective group ID. 

-k file file exists and has its “sticky bit” set. 

-L file file exists and is a symbolic link. 

-0 file file exists and is owned by the effective user ID. 

-p file file exists and is a named pipe. 

-r file file exists and is readable (has readable permission for the effec- 
tive user). 


Expression Is true if: 


-s file file exists and has a length greater than zero. 

-S file file exists and is a network socket. 

-t fd fd is a file descriptor directed to/from the terminal. This can be 
used to determine whether standard input/output/error is being 
redirected. 

-u file file exists and is setuid. 

-w file file exists and is writable (has write permission for the effective 
user). 

-x file file exists and is executable (has execute/search permission for 


the effective user). 


Here we have a script that demonstrates some of the file expressions: 


#!/bin/bash 
# test-file: Evaluate the status of a file 
FILE=*/.bashrc 


if [ -e "$FILE" ]; then 

if [ -f "$FILE" ]; then 
echo "$FILE is a regular file." 
fi 
if [ -d "$FILE" ]; then 
echo "$FILE is a directory." 
fi 
if [ -r "$FILE" ]; then 
echo "$FILE is readable." 
fi 
if [ -w "$FILE" ]; then 
echo "$FILE is writable." 
fi 
if [ -x "$FILE" ]; then 

echo "$FILE is executable/searchable." 


fi 

else 
echo "$FILE does not exist" 
exit 1 

fi 

exit 


The script evaluates the file assigned to the constant FILE and displays 
its results as the evaluation is performed. There are two interesting things 
to note about this script. First, notice how the parameter $FILE is quoted 
within the expressions. This is not required to syntactically complete the 
expression; rather, it is a defense against the parameter being empty or 
containing only whitespace. If the parameter expansion of $FILE were to 
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result in an empty value, it would cause an error (the operators would be 
interpreted as non-null strings rather than operators). Using the quotes 
around the parameter ensures that the operator is always followed by 

a string, even if the string is empty. Second, notice the presence of the 
exit command near the end of the script. The exit command accepts a 
single, optional argument, which becomes the script’s exit status. When 
no argument is passed, the exit status defaults to the exit status of the last 
command executed. Using exit in this way allows the script to indicate 
failure if $FILE expands to the name of a nonexistent file. The exit com- 
mand appearing on the last line of the script is there as a formality. When 
a script “runs off the end” (reaches end of file), it terminates with an exit 
status of the last command executed. 

Similarly, shell functions can return an exit status by including an integer 
argument to the return command. If we were to convert the previous script to 
a shell function to include it in a larger program, we could replace the exit 
commands with return statements and get the desired behavior. 


test_file () { 
# test-file: Evaluate the status of a file 
FILE=~/.bashrc 


if [ -e "$FILE" ]; then 
if [ -f "$FILE" ]; then 
echo "$FILE is a regular file." 
fi 
if [ -d "$FILE" ]; then 
echo "$FILE is a directory." 
fi 
if [ -r "$FILE" ]; then 
echo "$FILE is readable." 
fi 
if [ -w "$FILE" ]; then 
echo "$FILE is writable." 
fi 
if [ -x "$FILE" ]; then 
echo "$FILE is executable/searchable." 


fi 
else 
echo "$FILE does not exist" 
return 1 
fi 
} 
String Expressions 


Table 27-2 lists the expressions used to evaluate strings. 


Table 27-2: test String Expressions 


Expression Is true if: 

string string is not null. 

-n string The length of string is greater than zero. 

-z string The length of string is zero. 

string1 = string2 string1 and string2 are equal. Single or double equal signs 
string1 == string2 may be used. The use of double equal signs is greatly pre- 


ferred, but it is not POSIX compliant. 


string1 != string2 string1 and string2 are not equal. 
string1 > string2 string1 sorts after string2. 
string1 < string2 string1 sorts before string2. 


The > and < expression operators must be quoted (or escaped with a backslash) when 
used with test. If they are not, they will be interpreted by the shell as redirection 
operators, with potentially destructive results. Also note that while the bash docu- 
mentation states that the sorting order conforms to the collation order of the current 
locale, it does not. ASCII (POSIX) order is used in versions of bash up to and 
including 4.0. This problem was fixed in version 4.1. 


Here is a script that incorporates string expressions: 


#!/bin/bash 
# test-string: evaluate the value of a string 
ANSWER=maybe 


if [ -z "$ANSWER" ]; then 
echo "There is no answer." >&2 
exit 1 

fi 


if [ "$ANSWER" = "yes" ]; then 
echo "The answer is YES." 
elif [ "$ANSWER" = "no" ]; then 
echo "The answer is NO." 
elif [ "$ANSWER" = "maybe" ]; then 
echo "The answer is MAYBE." 
else 
echo "The answer is UNKNOWN." 
fi 


In this script, we evaluate the constant ANSWER. We first determine whether 
the string is empty. If it is, we terminate the script and set the exit status to 1. 
Notice the redirection that is applied to the echo command. This redirects the 
error message “There is no answer.” to standard error, which is the proper 
thing to do with error messages. If the string is not empty, we evaluate the 
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value of the string to see whether it is equal to either “yes,” “no,” or “maybe.” 
We do this by using elif, which is short for “else if.” By using elif, we are able 
to construct a more complex logical test. 


Integer Expressions 


To compare values as integers rather than as strings, we can use the expres- 
sions listed in Table 27-3. 


Table 27-3: test Integer Expressions 


Expression Is true if: 

integer1 -eq integer2 integer1 is equal to integer2. 

integer1 -ne integer2 integer1 is not equal to integer2 

integer1 -le integer2 integer1 is less than or equal to integerz. 
integer1 -1t integer2 integer is less than integer2. 

integer1 -ge integer2 integer1 is greater than or equal to integer2. 
integer1 -gt integer2 integer1 is greater than integer2. 


Here is a script that demonstrates them: 


#!/bin/bash 
# test-integer: evaluate the value of an integer. 
INT=-5 


if [ -z "$INT" ]; then 
echo "INT is empty." >&2 
exit 1 

fi 


if [ "$INT" -eq 0 ]; then 
echo "INT is zero." 
else 
if [ "$INT" -lt 0 ]; then 
echo "INT is negative." 
else 
echo "INT is positive." 
fi 
if [ $((INT % 2)) -eq 0 ]; then 
echo "INT is even." 
else 


echo "INT is odd." 
fi 
fi 


The interesting part of the script is how it determines whether an integer 
is even or odd. By performing a modulo 2 operation on the number, which 
divides the number by 2 and returns the remainder, it can tell whether the 
number is odd or even. 
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A More Modern Version of test 


Modern versions of bash include a compound command that acts as an 
enhanced replacement for test. It uses the following syntax: 


[[ expression ]] 


where, like test, expression is an expression that evaluates to either a true 
or false result. The [[ ]] command is similar to test (it supports all of its 
expressions) but adds an important new string expression. 


string1 =~ regex 


This returns true if string1 is matched by the extended regular 
expression regex. This opens up a lot of possibilities for performing such 
tasks as data validation. In our earlier example of the integer expressions, 
the script would fail if the constant INT contained anything except an inte- 
ger. The script needs a way to verify that the constant contains an integer. 
Using [[ ]] with the =~ string expression operator, we could improve the 
script this way: 


#!/bin/bash 

# test-integer2: evaluate the value of an integer. 
INT=-5 

if [[ "$INT" =~ *-?[0-9]+$ ]]; then 


if [ "$INT" -eq 0 ]; then 
echo "INT is zero." 


else 
if [ "$INT" -1t 0 ]; then 
echo "INT is negative." 
else 
echo "INT is positive." 
fi 
if [ $((INT % 2)) -eq 0 ]; then 
echo "INT is even." 
else 
echo "INT is odd." 
fi 
fi 
else 
echo "INT is not an integer." >&2 
exit 1 
fi 


By applying the regular expression, we are able to limit the value of 
INT to only strings that begin with an optional minus sign, followed by one 
or more numerals. This expression also eliminates the possibility of empty 
values. 
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Another added feature of [[ ]] is that the == operator supports pattern 
matching the same way pathname expansion does. Here’s an example: 


[me@linuxbox ~]$ FILE=foo.bar 

[me@linuxbox ~]$ if [[ $FILE == foo.* ]]; then 
> echo "$FILE matches pattern 'foo.*'" 

> fi 

foo.bar matches pattern 'foo.*' 


This makes [[ ]] useful for evaluating file and pathnames 


(( ))—Designed for Integers 
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In addition to the [[ ]] compound command, bash also provides the ((_ )) 
compound command, which is useful for operating on integers. It supports a 
full set of arithmetic evaluations, a subject we will cover fully in Chapter 34. 

((_)) is used to perform arithmetic truth tests. An arithmetic truth test 
results in true if the result of the arithmetic evaluation is non-zero. 


[me@linuxbox ~]$ if ((1)); then echo "It is true."; fi 
It is true. 

[me@linuxbox ~]$ if ((0)); then echo "It is true."; fi 
[me@linuxbox ~]$ 


Using (( )), we can slightly simplify the test-integer2 script like this: 


#!/bin/bash 
# test-integer2a: evaluate the value of an integer. 
INT=-5 


if [[ "$INT" =~ *-2[0-9]+$ ]]; then 
if ((INT == 0)); then 
echo "INT is zero." 
else 
if ((INT < 0)); then 
echo "INT is negative." 
else 
echo "INT is positive." 
fi 
if (( ((INT % 2)) == 0)); then 
echo "INT is even." 


else 
echo "INT is odd." 
fi 
fi 
else 
echo "INT is not an integer." >&2 
exit 1 
fi 


Notice that we use less-than and greater-than signs and that == is used 
to test for equivalence. This is a more natural-looking syntax for working 
with integers. Notice too, that because the compound command (( )) is 
part of the shell syntax rather than an ordinary command and it deals 
only with integers, it is able to recognize variables by name and does not 
require expansion to be performed. We’ll discuss ((_ )) and the related 
arithmetic expansion further in Chapter 34. 


Combining Expressions 


It’s also possible to combine expressions to create more complex evalua- 
tions. Expressions are combined by using logical operators. We saw these 
in Chapter 17 when we learned about the find command. There are three 
logical operations for test and [[ ]]. They are AND, OR, and NOT. test 
and [[ ]] use different operators to represent these operations, as shown 
in Table 27-4. 


Table 27-4: Logical Operators 


“Operation test’ ~=—=Ss[[ J] and (( )) 
"AND.t~é‘ieawé~=<‘“SCOC#RSO!”!”!”!”!!!!OC~™ 
OR -0 || 
NOT ! : 


Here’s an example of an AND operation. The following script deter- 
mines whether an integer is within a range of values: 


#!/bin/bash 


# test-integer3: determine if an integer is within a 
# specified range of values. 


MIN VAL=1 
MAX_VAL=100 


INT=50 


if [[ "$INT" =~ *-?[0-9]+$ ]]; then 
if [[ "$INT" -ge "$MIN_VAL" 8& "$INT" -le "$MAX_VAL" ]]; then 
echo "$INT is within $MIN_VAL to $MAX_VAL." 
else 
echo "$INT is out of range." 
fi 
else 
echo "INT is not an integer." >&2 
exit 1 
fi 
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In this script, we determine whether the value of integer INT lies between 
the values of MIN_VAL and MAX_VAL. This is performed by a single use of [[ ]], 
which includes two expressions separated by the & operator. We also could 
have coded this using test: 


if [ "$INT" -ge "$MIN_VAL" -a "$INT" -le "$MAX_VAL" ]; then 
echo "$INT is within $MIN_VAL to $MAX_VAL." 

else 
echo "$INT is out of range." 

fi 


The ! negation operator reverses the outcome of an expression. It 
returns true if an expression is false, and it returns false if an expression 
is true. In the following script, we modify the logic of our evaluation to 
find values of INT that are outside the specified range: 


#!/bin/bash 


# test-integer4: determine if an integer is outside a 
# specified range of values. 


MIN VAL=1 
MAX_VAL=100 


INT=50 
if [[ "$INT" =~ *-2?[0-9]+$ ]]; then 


if [[ | ("$INT" -ge "$MIN VAL" 8& "$INT" -le "$MAX VAL") ]]; then 
echo "$INT is outside $MIN_VAL to $MAX_VAL." 


else 
echo "$INT is in range." 
fi 
else 
echo "INT is not an integer." >&2 
exit 1 
fi 


We also include parentheses around the expression, for grouping. If 
these were not included, the negation would only apply to the first expres- 
sion and not the combination of the two. Coding this with test would be 
done this way: 


if [ ! \( "$INT" -ge "$MIN VAL" -a "$INT" -le "$MAX_VAL" \) ]; then 
echo "$INT is outside $MIN_VAL to $MAX_VAL." 

else 
echo "$INT is in range." 

fi 


Since all expressions and operators used by test are treated as command 
arguments by the shell (unlike [[ ]] and (( )) ), characters that have special 
meaning to bash, such as <, >, (, and ), must be quoted or escaped. 
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Seeing that test and [[ ]] do roughly the same thing, which is prefer- 
able? test is traditional (and part of the POSIX specification for standard 
shells, which are often used for system startup scripts), whereas [[ ]] is spe- 
cific to bash (and a few other modern shells). It’s important to know how to 
use test since it is widely used, but [[ ]] is clearly more useful and is easier 
to code, so it is preferred for modern scripts. 


PORTABILITY IS THE HOBGOBLIN OF LITTLE MINDS 


If you talk to “real” Unix people, you quickly discover that many of them don’t 
like Linux very much. They regard it as impure and unclean. One tenet of Unix 
users is that everything should be “portable.” This means that any script you 
write should be able to run, unchanged, on any Unix-like system. 

Unix people have good reason to believe this. Having seen what propri- 
etary extensions to commands and shells did to the Unix world before POSIX, 


they are naturally wary of the effect of Linux on their beloved OS. 

But portability has a serious downside. It prevents progress. It requires that 
things are always done using “lowest common denominator” techniques. In the 
case of shell programming, it means making everything compatible with sh, the 
original Bourne shell. 

This downside is the excuse that proprietary software vendors use to justify 
their proprietary extensions, only they call them “innovations.” But they are 
really just lock-in devices for their customers. 

The GNU tools, such as bash, have no such restrictions. They encourage 
portability by supporting standards and by being universally available. You 
can install bash and the other GNU tools on almost any kind of system, even 
Windows, without cost. So feel free to use all the features of bash. It’s really 
portable. 


Control Operators: Another Way to Branch 


bash provides two control operators that can perform branching. The && 
(AND) and || (OR) operators work like the logical operators in the [[ ]] 
compound command. Here is the syntax for &&: 


command1 && command2 


Here is the syntax for | |: 


command1 || command2 


It is important to understand the behavior of these. With the && operator, 
command1 is executed, and command2 is executed if, and only if, command1 is success- 
ful. With the || operator, command1 is executed and command2 is executed if, and 
only if, command1 is unsuccessful. 
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In practical terms, it means that we can do something like this: 


[me@linuxbox ~]$ mkdir temp && cd temp 


This will create a directory named temp, and if it succeeds, the current 
working directory will be changed to temp. The second command is attempted 
only if the mkdir command is successful. Likewise, a command like this: 


[me@linuxbox ~]$ [[ -d temp ]] || mkdir temp 


will test for the existence of the directory temp, and only if the test fails will 
the directory be created. This type of construct is handy for handling errors 
in scripts, a subject we will discuss more in later chapters. For example, we 
could do this in a script: 


[ -d temp ] || exit 1 


If the script requires the directory temp and it does not exist, then the 
script will terminate with an exit status of 1. 
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We started this chapter with a question. How could we make our sys_info_page 
script detect whether the user had permission to read all the home directo- 
ries? With our knowledge of if, we can solve the problem by adding this code 
to the report_home_space function: 


report_home_space () { 
if [[ "$(id -u)" -eq 0 ]]; then 
cat <<- _EOF_ 
<h2>Home Space Utilization (All Users)</h2> 
<pre>$(du -sh /home/*)</pre> 


_EOF_ 

else 

cat <<- _EOF_ 

<h2>Home Space Utilization ($USER)</h2> 
<pre>$(du -sh $HOME)</pre> 
_EOF_ 

fi 

return 


We evaluate the output of the id command. With the -u option, id outputs 
the numeric user ID number of the effective user. The superuser is always ID 
zero, and every other user is a number greater than zero. Knowing this, we 
can construct two different here documents, one taking advantage of super- 
user privileges and the other restricted to the user’s own home directory. 

We are going to take a break from the sys_info_page program, but don’t 
worry. It will be back. In the meantime, we’ll cover some topics that we’ll 
need when we resume our work. 


READING KEYBOARD INPUT 


The scripts we have written so far lack a 

feature common in most computer pro- 
grams: interactivity, or the capability of 

the program to interact with the user. While 

many programs don’t need to be interactive, some 
programs benefit from being able to accept input 
directly from the user. Take, for example, this script 
from the previous chapter: 


#!/bin/bash 

# test-integer2: evaluate the value of an integer. 
INT=-5 

if [[ "$INT" =~ *-?[0-9]+$ ]]; then 


if [ "$INT" -eq 0 ]; then 
echo "INT is zero." 
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else 

if [ "$INT" -1t 0 ]; then 
echo "INT is negative." 

else 
echo "INT is positive." 

fi 

if [ $((INT % 2)) -eq 0 ]; then 
echo "INT is even." 


else 
echo "INT is odd." 
fi 
fi 
else 
echo "INT is not an integer." >&2 
exit 1 
fi 


Each time we want to change the value of INT, we have to edit the script. 
It would be much more useful if the script could ask the user for a value. 
In this chapter, we will begin to look at how we can add interactivity to our 
programs. 


read—Read Values from Standard Input 
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The read builtin command is used to read a single line of standard input. 
This command can be used to read keyboard input or, when redirection is 
employed, a line of data from a file. The command has the following syntax: 


read [-options] [variable...] 


where options is one or more of the available options listed later in Table 28-1 
and variable is the name of one or more variables used to hold the input 
value. If no variable name is supplied, the shell variable REPLY contains the 
line of data. 

Basically, read assigns fields from standard input to the specified vari- 
ables. If we modify our integer evaluation script to use read, it might look 
like this: 


#!/bin/bash 


# read-integer: evaluate the value of an integer. 
echo -n "Please enter an integer -> " 
read int 


if [[ "$int" =~ *-?[0-9]+$ ]]; then 
if [ "$int" -eq 0 ]; then 
echo "$int is zero." 
else 
if [ "$int" -1t 0 ]; then 
echo "$int is negative." 


else 
echo "$int is positive." 
fi 
if [ $((int % 2)) -eq 0 ]; then 
echo "$int is even." 
else 
echo "$int is odd." 


fi 
fi 
else 
echo "Input value is not an integer." >&2 
exit 1 
fi 


We use echo with the -n option (which suppresses the trailing newline 
on output) to display a prompt, and then we use read to input a value for the 
variable int. Running this script results in this: 


[me@linuxbox ~]$ read-integer 
Please enter an integer -> 5 
5 is positive. 

5 is odd. 


read can assign input to multiple variables, as shown in this script: 


#!/bin/bash 


# read-multiple: read multiple values from keyboard 
echo -n "Enter one or more values > " 
read var1 var2 var3 var4 var5 

echo "vari = '$var1'" 
echo "var2 = '$var2 
echo "var3 = '$var3 
echo "var4 = '$var4 
echo "var5 = '$var5 


In this script, we assign and display up to five values. Notice how read 
behaves when given different numbers of values, shown here: 


[me@linuxbox ~]$ read-multiple 
Enter one or more values > abcde 


var1 = ‘a 
var2 = 'b' 
var3 = ‘c' 
var4 = ‘d' 
var5 = ‘e' 


[me@linuxbox ~]$ read-multiple 
Enter one or more values > a 
vari = ‘a' 

var2 = 


var3 = 
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var4 = 
var5 = 
[me@linuxbox ~]$ read-multiple 

Enter one or more values >abcdefg 


var1 = ‘a 
var2 = 'b' 
var3 = ‘c' 
var4 = 'd' 
var5 = 'e f g' 


If read receives fewer than the expected number, the extra variables are 
empty, while an excessive amount of input results in the final variable con- 
taining all of the extra input. 

If no variables are listed after the read command, a shell variable, REPLY, 
will be assigned all the input. 


#!/bin/bash 


# read-single: read multiple values into default variable 


echo -n "Enter one or more values > 
read 


echo "REPLY = '$REPLY'" 


Running this script results in this: 


[me@linuxbox ~]$ read-single 
Enter one or more values > a bc d 
REPLY = 'a bc d' 


Options 
read supports the options described in Table 28-1. 


Table 28-1: read Options 
Option Description 


-a array Assign the input to array, starting with index zero. We will cover 
arrays in Chapter 35. 


-d delimiter The first character in the string delimiter is used to indicate the end of 
input, rather than a newline character. 


-e Use Readline to handle input. This permits input editing in the same 
manner as the command line. 

-i string Use string as a default reply if the user simply presses ENTER. Requires 
the -e option. 

-n num Read num characters of input, rather than an entire line. 

-p prompt Display a prompt for input using the string prompt. 


Option Description 


= Raw mode. Do not interpret backslash characters as escapes. 


-s Silent mode. Do not echo characters to the display as they are 
typed. This is useful when inputting passwords and other confiden- 
tial information. 


-t seconds Timeout. Terminate input after seconds. read returns a non-zero exit 
status if an input times out. 


-u fd Use input from file descriptor fd, rather than standard input. 


Using the various options, we can do interesting things with read. For 
example, with the -p option, we can provide a prompt string. 


#!/bin/bash 
# read-single: read multiple values into default variable 


read -p "Enter one or more values > 


echo "REPLY = '$REPLY'" 


With the -t and -s options, we can write a script that reads “secret” 
input and times out if the input is not completed in a specified time. 


#!/bin/bash 


# read-secret: input a secret passphrase 
if read -t 10 -sp "Enter secret passphrase > " secret_pass; then 
echo -e "\nSecret passphrase = '$secret_pass'" 
else 
echo -e "\nInput timed out" >&2 
exit 1 
fi 


The script prompts the user for a secret passphrase and waits ten sec- 
onds for input. If the entry is not completed within the specified time, the 
script exits with an error. Because the -s option is included, the characters 
of the passphrase are not echoed to the display as they are typed. 

It’s possible to supply the user with a default response using the -e and 
-i options together. 


#!/bin/bash 
# read-default: supply a default value if user presses Enter key. 


read -e -p "What is your user name? " -i $USER 
echo "You answered: '$REPLY'" 
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In this script, we prompt the user to enter a username and use the envi- 
ronment variable USER to provide a default value. When the script is run, 
it displays the default string, and if the user simply presses ENTER, read will 
assign the default string to the REPLY variable. 


[me@linuxbox ~]$ read-default 
What is your user name? me 
You answered: 'me' 


IFS 


Normally, the shell performs word-splitting on the input provided to read. 
As we have seen, this means that multiple words separated by one or more 
spaces become separate items on the input line and are assigned to sepa- 
rate variables by read. This behavior is configured by a shell variable named 
IFS (for Internal Field Separator). The default value of IFS contains a space, 
a tab, and a newline character, each of which will separate items from one 
another. 

We can adjust the value of IFS to control the separation of fields input to 
read. For example, the /etc/passwd file contains lines of data that use the colon 
character as a field separator. By changing the value of IFS to a single colon, 
we can use read to input the contents of /etc/passwd and successfully separate 
fields into different variables. Here we have a script that does just that: 


#!/bin/bash 

# read-ifs: read fields from a file 
FILE=/etc/passwd 

read -p "Enter a username > " user_name 
file_info="$(grep "*$user_name:" $FILE)" 


if [ -n "$file_info" ]; then 


IFS=":" read user pw uid gid name home shell <<< "$file_info" 
echo "User = "$user'" 
echo "UID = "$uid'" 
echo "GID = "$gid'" 
echo "Full Name = '$name'" 
echo "Home Dir. = '$home'" 
echo "Shell = "$shell'" 
else 
echo "No such user '$user_name'" >&2 
exit 1 
fi 


This script prompts the user to enter the username of an account on 
the system and then displays the different fields found in the user’s record 
in the /etc/passwd file. The script contains two interesting lines. The first is 
as follows: 


file_info=$(grep "*$user_name:" $FILE) 


This line assigns the results of a grep command to the variable file_info. 
The regular expression used by grep assures that the username will match 
only a single line in the /etc/passwd file. 

The second interesting line is this one: 


IFS=":" read user pw uid gid name home shell <<< "$file_info" 


The line consists of three parts: a variable assignment, a read command 
with a list of variable names as arguments, and a strange new redirection 
operator. We’ll look at the variable assignment first. 

The shell allows one or more variable assignments to take place imme- 
diately before a command. These assignments alter the environment for the 
command that follows. The effect of the assignment is temporary, chang- 
ing the environment only for the duration of the command. In our case, 
the value of IFS is changed to a colon character. Alternately, we could have 
coded it this way: 


OLD_IFS="$IFS" 

IFS=":" 

read user pw uid gid name home shell <<< "$file_info" 
IFS="$0LD_IFS" 


where we store the value of IFS, assign a new value, perform the read com- 
mand, and then restore IFS to its original value. Clearly, placing the vari- 
able assignment in front of the command is a more concise way of doing 

the same thing. 

The <<< operator indicates a here string. A here string is like a here doc- 
ument, only shorter, consisting of a single string. In our example, the line 
of data from the /etc/passwd file is fed to the standard input of the read 
command. We might wonder why this rather oblique method was chosen 
rather than this: 


echo "$file_info" | IFS=":" read user pw uid gid name home shell 


Well, there’s areason... 


Reading Keyboard Input 369 


370 


YOU CAN’T PIPE READ 


While the read command normally takes input from standard input, you cannot 


do this: 


echo "foo" | read 


We would expect this to work, but it does not. The command will appear 
to succeed, but the REPLY variable will always be empty. Why is this? 
The explanation has to do with the way the shell handles pipelines. In bash 


(and other shells such as sh), pipelines create subshells. These are copies of the 
shell and its environment that are used to execute the command in the pipeline. 
In the previous example, read is executed in a subshell. 

Subshells in Unix-like systems create copies of the environment for the pro- 
cesses to use while they execute. When the processes finishes, the copy of the 
environment is destroyed. This means that a subshell can never alter the environ- 
ment of its parent process. read assigns variables, which then become part of 
the environment. In the previous example, read assigns the value foo to the vari- 
able REPLY in its subshell’s environment, but when the command exits, the sub- 
shell and its environment are destroyed, and the effect of the assignment is lost. 

Using here strings is one way to work around this behavior. Another 
method is discussed in Chapter 36. 


Validating Input 


Chapter 28 


With our new ability to have keyboard input comes an additional pro- 
gramming challenge: validating input. Often the difference between a 
well-written program and a poorly written one lies in the program’s ability 
to deal with the unexpected. Frequently, the unexpected appears in the 
form of bad input. We’ve done a little of this with our evaluation pro- 
grams in the previous chapter, where we checked the values of integers 
and screened out empty values and non-numeric characters. It is impor- 
tant to perform these kinds of programming checks every time a program 
receives input to guard against invalid data. This is especially important 
for programs that are shared by multiple users. Omitting these safeguards 
in the interests of economy might be excused if a program is to be used 
once and only by the author to perform some special task. Even then, if 
the program performs dangerous tasks such as deleting files, it would be 
wise to include data validation, just in case. 


Here we have an example program that validates various kinds of input: 


#!/bin/bash 
# read-validate: validate input 


invalid_input () { 
echo "Invalid input '$REPLY'" >&2 
exit 1 


} 


read -p "Enter a single item > 


# input is empty (invalid) 
[[ -z "$REPLY" ]] && invalid_input 


# input is multiple items (invalid) 
(( "$(echo "$REPLY" | wc -w)" > 1 )) && invalid_input 


# is input a valid filename? 
if [[ "$REPLY" =~ *[-[:alnum:]\._]+$ ]]; then 
echo "'$REPLY' is a valid filename." 
if [[ -e "$REPLY" ]]; then 
echo "And file '$REPLY' exists.” 
else 
echo "However, file '$REPLY' does not exist." 
fi 


# is input a floating point number? 

if [[ "$REPLY" =~ *-?[[:digit:]]*\.[[:digit:]]+$ ]]; then 
echo "'$REPLY' is a floating point number." 

else 
echo "'$REPLY' is not a floating point number." 

fi 


# is input an integer? 
if [[ "$REPLY" =~ 4-?[[:digit:]]+$ ]]; then 
echo "'$REPLY' is an integer." 
else 
echo "'$REPLY' is not an integer." 
fi 
else 
echo "The string '$REPLY' is not a valid filename." 
fi 


This script prompts the user to enter an item. The item is subsequently 
analyzed to determine its contents. As we can see, the script makes use of 
many of the concepts that we have covered thus far, including shell func- 
tions, [[ ]], (( )), the control operator &8&, and if, as well as a healthy dose 
of regular expressions. 
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A common type of interactivity is called menu-driven. In menu-driven pro- 
grams, the user is presented with a list of choices and is asked to choose one. 
For example, we could imagine a program that presented the following: 


Please Select: 


Display System Information 
Display Disk Space 

Display Home Space Utilization 
Quit 


OWNRB 
arse Ce le 


Enter selection [0-3] > 


Using what we learned from writing our sys_info_page program, we 
can construct a menu-driven program to perform the tasks on the previ- 
ous menu. 


#!/bin/bash 
# read-menu: a menu driven system information program 
clear 


echo 
Please Select: 


Display System Information 
Display Disk Space 

Display Home Space Utilization 
Quit 


OWNRB 
- . 8 6 


read -p "Enter selection [0-3] > 


if [[ "$REPLY" =~ *[0-3]$ ]]; then 
if [[ "$REPLY" == 0 ]]; then 
echo "Program terminated." 
exit 
fi 
if [[ "$REPLY" == 1 ]]; then 
echo "Hostname: $HOSTNAME" 


uptime 
exit 
fi 
if [[ "$REPLY" == 2 ]]; then 
df -h 
exit 
fi 


if [[ "$REPLY" == 3 ]]; then 
if [[ "$(id -u)" -eq 0 J]; then 
echo "Home Space Utilization (All Users)" 
du -sh /home/* 
else 
echo "Home Space Utilization ($USER)" 


du -sh "$HOME" 


fi 
exit 
fi 
else 
echo "Invalid entry." >&2 
exit 1 
fi 


This script is logically divided into two parts. The first part displays the 
menu and inputs the response from the user. The second part identifies 
the response and carries out the selected action. Notice the use of the exit 
command in this script. It is used here to prevent the script from executing 
unnecessary code after an action has been carried out. The presence of 
multiple exit points in a program is generally a bad idea (it makes program 
logic harder to understand), but it works in this script. 


Summing Up 


In this chapter, we took our first steps toward interactivity, allowing users to 
input data into our programs via the keyboard. Using the techniques pre- 
sented thus far, it is possible to write many useful programs, such as special- 
ized calculation programs and easy-to-use front ends for arcane command 
line tools. In the next chapter, we will build on the menu-driven program 
concept to make it even better. 


Extra Credit 


It is important to study the programs in this chapter carefully and have 
a complete understanding of the way they are logically structured, as the 
programs to come will be increasingly complex. As an exercise, rewrite 
the programs in this chapter using the test command rather than the 

[[ ]] compound command. Hint: Use grep to evaluate the regular expres- 
sions and evaluate the exit status. This will be good practice. 
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FLOW CONTROL: LOOPING 
WITH WHILE/UNTIL 


In the previous chapter, we developed a 
menu-driven program to produce various 


kinds of system information. The program 
works, but it still has a significant usability 

problem. It executes only a single choice and then 

terminates. Even worse, if an invalid selection is made, 


the program terminates with an error, without giving the user an opportu- 
nity to try again. It would be better if we could somehow construct the pro- 
gram so that it could repeat the menu display and selection over and over, 
until the user chooses to exit the program. 

In this chapter, we will look at a programming concept called looping, 
which can be used to make portions of programs repeat. The shell provides 
three compound commands for looping. We will look at two of them in this 
chapter and the third in a later chapter. 
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Daily life is full of repeated activities. Going to work each day, walking the 
dog, and slicing a carrot are all tasks that involve repeating a series of steps. 
Let’s consider slicing a carrot. If we express this activity in pseudocode, it 
might look something like this: 


Get cutting board. 

Get knife. 

Place carrot on cutting board. 
Lift knife. 

Advance carrot. 


Slice carrot. 


SOOO Oe Ne 


If entire carrot sliced, then quit; else go to step 4. 
Steps 4 through 7 form a loop. The actions within the loop are repeated 


until the condition, “entire carrot sliced,” is reached. 


while 


bash can express a similar idea. Let’s say we wanted to display five numbers in 
sequential order from | to 5. A bash script could be constructed as follows: 


#!/bin/bash 
# while-count: display a series of numbers 
count=1 


while [[ "$count" -le 5 ]]; do 
echo "$count" 
count=$((count + 1)) 

done 

echo "Finished." 


When executed, this script displays the following: 


me@linuxbox ~]$ while-count 


Tm BPWN PRM 


inished. 


The syntax of the while command is as follows: 


while commands; do commands; done 


Like if, while evaluates the exit status of a list of commands. As long as 
the exit status is zero, it performs the commands inside the loop. In the previ- 
ous script, the variable count is created and assigned an initial value of 1. The 
while command evaluates the exit status of the [[]] compound command. 

As long as the [[]] command returns an exit status of zero, the commands 
within the loop are executed. At the end of each cycle, the [[]] command is 
repeated. After 5 iterations of the loop, the value of count has increased to 6, 
the [[]] command no longer returns an exit status of zero, and the loop ter- 
minates. The program continues with the next statement following the loop. 

We can use a while loop to improve the read-menu program from the pre- 
vious chapter. 


#!/bin/bash 
# while-menu: a menu driven system information program 
DELAY=3 # Number of seconds to display results 


while [[ "$REPLY" != 0 ]]; do 


clear 
cat <<- _EOF_ 
Please Select: 
1. Display System Information 
2. Display Disk Space 
3. Display Home Space Utilization 
oO. Quit 
EOF_ 


read -p "Enter selection [0-3] > 


if [[ "$REPLY" =~ *[0-3]$ ]]; then 


if [[ $REPLY == 1 ]]; then 
echo "Hostname: $HOSTNAME" 
uptime 
sleep "$DELAY" 
fi 
if [[ "$REPLY" == 2 ]]; then 
df -h 
sleep "$DELAY" 
fi 


if [[ "$REPLY" == 3 ]]; then 
if [[ "$(id -u)" -eq 0 ]]; then 
echo "Home Space Utilization (All Users)" 
du -sh /home/* 
else 
echo "Home Space Utilization ($USER)" 
du -sh "$HOME" 
fi 
sleep "$DELAY" 
fi 
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else 
echo "Invalid entry." 
sleep "$DELAY" 
fi 
done 
echo "Program terminated." 


By enclosing the menu in a while loop, we are able to have the program 
repeat the menu display after each selection. The loop continues as long as 
REPLY is not equal to 0 and the menu is displayed again, giving the user the 
opportunity to make another selection. At the end of each action, a sleep 
command is executed, so the program will pause for a few seconds to allow 
the results of the selection to be seen before the screen is cleared and the 
menu is redisplayed. Once REPLY is equal to 0, indicating the “quit” selection, 
the loop terminates, and execution continues with the line following done. 


Breaking Out of a Loop 


Chapter 29 


bash provides two builtin commands that can be used to control program 
flow inside loops. The break command immediately terminates a loop, and 
program control resumes with the next statement following the loop. The 
continue command causes the remainder of the loop to be skipped, and pro- 
gram control resumes with the next iteration of the loop. Here we see a ver- 
sion of the while-menu program incorporating both break and continue: 


#!/bin/bash 
# while-menu2: a menu driven system information program 
DELAY=3 # Number of seconds to display results 


while true; do 


clear 

cat <<- _EOF_ 
Please Select: 
1. Display System Information 
2. Display Disk Space 
3. Display Home Space Utilization 
oO. Quit 

EOF_ 


read -p "Enter selection [0-3] > 


if [[ "$REPLY" =~ *[0-3]$ ]]; then 
if [[ "$REPLY" == 1 ]]; then 
echo "Hostname: $HOSTNAME" 
uptime 
sleep "$DELAY" 
continue 
TL 


if [[ "$REPLY" == 2 ]]; then 
df -h 
sleep "$DELAY" 
continue 
fi 
if [[ "$REPLY" == 3 ]]; then 
if [[ "$(id -u)" -eq 0 ]]; then 
echo "Home Space Utilization (All Users)" 
du -sh /home/* 


else 
echo "Home Space Utilization ($USER)" 
du -sh "$HOME" 
fi 
sleep "$DELAY" 
continue 
fi 
if [[ "$REPLY" == 0 ]]; then 
break 
fi 
else 
echo "Invalid entry." 
sleep "$DELAY" 
fi 


done 
echo "Program terminated." 


In this version of the script, we set up an endless loop (one that never 
terminates on its own) by using the true command to supply an exit status 
to while. Since true will always exit with an exit status of zero, the loop 
will never end. This is a surprisingly common scripting technique. Since 
the loop will never end on its own, it’s up to the programmer to provide 
some way to break out of the loop when the time is right. In this script, 
the break command is used to exit the loop when the 0 selection is chosen. 
The continue command has been included at the end of the other script 
choices to allow for more efficient execution. By using continue, the script 
will skip over code that is not needed when a selection is identified. For 
example, if the 1 selection is chosen and identified, there is no reason to 
test for the other selections. 


until 


The until compound command is much like while, except instead of exit- 
ing a loop when a non-zero exit status is encountered, it does the opposite. 
An until loop continues until it receives a zero exit status. In our while-count 
script, we continued the loop as long as the value of the count variable was 
less than or equal to 5. We could get the same result by coding the script 
with until. 


#!/bin/bash 


# until-count: display a series of numbers 
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count=1 


until [[ "$count" -gt 5 ]]; do 
echo "$count" 
count=$((count + 1)) 

done 

echo "Finished." 


By changing the test expression to $count -gt 5, until will terminate 
the loop at the correct time. The decision of whether to use the while or 
until loop is usually a matter of choosing the one that allows the clearest 
test to be written. 


Reading Files with Loops 


Chapter 29 


while and until can process standard input. This allows files to be processed 
with while and until loops. In the following example, we will display the con- 
tents of the distros.ixt file used in earlier chapters: 


#!/bin/bash 
# while-read: read lines from a file 


while read distro version release; do 
printf "Distro: %s\tVersion: %s\tReleased: %s\n" \ 
"$distro" \ 
"$version" \ 
"$release" 
done < distros.txt 


To redirect a file to the loop, we place the redirection operator after 
the done statement. The loop will use read to input the fields from the redi- 
rected file. The read command will exit after each line is read, with a zero 
exit status until the end-of-file is reached. At that point, it will exit with a 
non-zero exit status, thereby terminating the loop. It is also possible to pipe 
standard input into a loop. 


#!/bin/bash 
# while-read2: read lines from a file 


sort -k 1,1 -k 2n distros.txt | while read distro version release; do 
printf "Distro: %s\tVersion: %s\tReleased: %s\n" \ 
"$distro" \ 
"$version" \ 
"$release" 
done 


Here we take the output of the sort command and display the stream of 
text. However, it is important to remember that since a pipe will execute the 


loop in a subshell, any variables created or assigned within the loop will be 
lost when the loop terminates. 


Summing Up 


With the introduction of loops and our previous encounters with branch- 
ing, subroutines, and sequences, we have covered the major types of flow 


control used in programs. bash has some more tricks up its sleeve, but they 
are refinements on these basic concepts. 
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TROUBLESHOOTING 


Now that our scripts become more com- 
plex, it’s time to look at what happens 


when things go wrong. In this chapter, we’ll 

look at some of the common kinds of errors 
that occur in scripts and examine a few useful tech- 
niques that can be used to track down and eradicate 
problems. 


Syntactic Errors 


One general class of errors is syntactic. Syntactic errors involve mistyping 
some element of shell syntax. The shell will stop executing a script when it 
encounters this type of error. 
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In the following discussions, we will use this script to demonstrate 
common types of errors. 


#!/bin/bash 
# trouble: script to demonstrate common errors 
number=1 


if [ $number = 1 ]; then 

echo "Number is equal to 1." 
else 

echo "Number is not equal to 1." 
fi 


As written, this script runs successfully. 


[me@linuxbox ~]$ trouble 
Number is equal to 1. 


Missing Quotes 


Let’s edit our script and remove the trailing quote from the argument fol- 
lowing the first echo command. 


#!/bin/bash 
# trouble: script to demonstrate common errors 
number=1 


if [ $number = 1 ]; then 

echo "Number is equal to 1. 
else 

echo "Number is not equal to 1." 
fi 


Here is what happens: 


[me@linuxbox ~]$ trouble 
/home/me/bin/trouble: line 10: unexpected EOF while looking for matching 
/home/me/bin/trouble: line 13: syntax error: unexpected end of file 


It generates two errors. Interestingly, the line numbers reported by 
the error messages are not where the missing quote was removed but 
rather much later in the program. If we follow the program after the 
missing quote, we can see why. bash will continue looking for the closing 
quote until it finds one, which it does, immediately after the second echo 
command. After that, bash becomes very confused. The syntax of the 


subsequent if command is broken because the fi statement is now inside 
a quoted (but open) string. 

In long scripts, this kind of error can be quite hard to find. Using an 
editor with syntax highlighting will help since, in most cases, it will display 
quoted strings in a distinctive manner from other kinds of shell syntax. Ifa 
complete version of vim is installed, syntax highlighting can be enabled by 
entering this command: 


:syntax on 


Missing or Unexpected Tokens 


Another common mistake is forgetting to complete a compound command, 
such as if or while. Let’s look at what happens if we remove the semicolon 
after test in the if command: 


#!/bin/bash 
# trouble: script to demonstrate common errors 
number=1 


if [ $number = 1 ] then 

echo "Number is equal to 1." 
else 

echo "Number is not equal to 1." 
fi 


The result is this: 


[me@linuxbox ~]$ trouble 
/home/me/bin/trouble: line 9: syntax error near unexpected token “else' 
/home/me/bin/trouble: line 9: ~else' 


Again, the error message points to an error that occurs later than the 
actual problem. What happens is really pretty interesting. As we recall, 
if accepts a list of commands and evaluates the exit code of the last com- 
mand in the list. In our program, we intend this list to consist of a single 
command, [, a synonym for test. The [ command takes what follows it as 
a list of arguments; in our case, that’s four arguments: $number, 1, =, and ]. 
With the semicolon removed, the word then is added to the list of argu- 
ments, which is syntactically legal. The following echo command is legal, 
too. It’s interpreted as another command in the list of commands that if 
will evaluate for an exit code. The else is encountered next, but it’s out of 
place since the shell recognizes it as a reserved word (a word that has special 
meaning to the shell) and not the name of a command, which is the reason 
for the error message. 
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Unanticipated Expansions 


It’s possible to have errors that occur only intermittently in a script. 
Sometimes the script will run fine, and other times it will fail because 
of the results of an expansion. If we return our missing semicolon and 
change the value of number to an empty variable, we can demonstrate. 


#!/bin/bash 
# trouble: script to demonstrate common errors 
number= 


if [ $number = 1 ]; then 

echo "Number is equal to 1." 
else 

echo "Number is not equal to 1." 
fi 


Running the script with this change results in the following output: 


[me@linuxbox ~]$ trouble 
/home/me/bin/trouble: line 7: [: =: unary operator expected 
Number is not equal to 1. 


We get this rather cryptic error message, followed by the output of the 
second echo command. The problem is the expansion of the number variable 
within the test command. When the following command: 


[ $number = 1 ] 


undergoes expansion with number being empty, the result is this: 


[=«2] 


which is invalid and the error is generated. The = operator is a binary oper- 
ator (it requires a value on each side), but the first value is missing, so the 
test command expects a unary operator (such as -z) instead. Further, since 
the test failed (because of the error), the if command receives a non-zero 
exit code and acts accordingly, and the second echo command is executed. 

This problem can be corrected by adding quotes around the first argu- 
ment in the test command. 


[ "$number" = 1 | 


Then when expansion occurs, the result will be this: 


pea 


This yields the correct number of arguments. In addition to empty 
strings, quotes should be used in cases where a value could expand into 
multiword strings, as with filenames containing embedded spaces. 


Make it a rule to always enclose variables and command substitutions in double 
quotes unless word splitting is needed. 


Logical Errors 


Unlike syntactic errors, logical errors do not prevent a script from running. 
The script will run, but it will not produce the desired result because of 
a problem with its logic. There are countless numbers of possible logical 
errors, but here are a few of the most common kinds found in scripts: 


e Incorrect conditional expressions. It’s easy to incorrectly code 
an if/then/else expression and have the wrong logic carried out. 
Sometimes the logic will be reversed, or it will be incomplete. 


e “Off by one” errors. When coding loops that employ counters, it is 
possible to overlook that the loop may require that the counting start 
with zero, rather than one, for the count to conclude at the correct 
point. These kinds of errors result in either a loop “going off the end” 
by counting too far or a loop missing the last iteration by terminating 
one iteration too soon. 


e Unanticipated situations. Most logic errors result from a program 
encountering data or situations that were unforeseen by the program- 
mer. As we have seen, this can also include unanticipated expansions, 
such as a filename that contains embedded spaces that expands into 
multiple command arguments rather than a single filename. 


Defensive Programming 


It is important to verify assumptions when programming. This means a 
careful evaluation of the exit status of programs and commands that are 
used by a script. Here is an example, based on a true story. An unfortunate 
system administrator wrote a script to perform a maintenance task on an 
important server. The script contained the following two lines of code: 


cd $dir_name 
xm * 


There is nothing intrinsically wrong with these two lines, as long as the 
directory named in the variable, dir_name, exists. But what happens if it does 
not? In that case, the cd command fails, and the script continues to the next 
line and deletes the files in the current working directory. Not the desired 
outcome at all! The hapless administrator destroyed an important part of 
the server because of this design decision. 
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Let’s look at some ways this design could be improved. First, it might 
be wise to ensure that the dir_name variable expands into only one word by 
quoting it and make the execution of rm contingent on the success of cd. 


cd "$dir_name" && rm * 


This way, if the cd command fails, the rm command is not carried out. 
This is better but still leaves open the possibility that the variable, dir_name, 
is unset or empty, which would result in the files in the user’s home direc- 
tory being deleted. This could also be avoided by checking to see that 
dir_name actually contains the name of an existing directory. 


[[ -d "$dir_name" ]] && cd "$dir_name" && rm * 


Often, it is best to include logic to terminate the script and report an 
error when a situation such as the one shown previously occurs. 


# Delete files in directory $dir_name 

if [[ ! -d "$dir_name" ]]; then 
echo "No such directory: '$dir_name'" >&2 
exit 1 

fi 

if ! cd "$dir_name"; then 
echo "Cannot cd to '$dir_name'" >&2 
exit 1 

fi 

if ! rm *; then 
echo "File deletion failed. Check results" >&2 
exit 1 

fi 


Here, we check both the name, to see that it is an existing directory, 
and the success of the cd command. If either fails, a descriptive error mes- 
sage is sent to standard error, and the script terminates with an exit status 
of one to indicate a failure. 


Watch Out for Filenames 


There is another problem with this file deletion script that is more obscure 
but could be very dangerous. Unix (and Unix-like operating systems) has, 
in the opinion of many, a serious design flaw when it comes to filenames. 
Unix is extremely permissive about them. In fact, there are only two char- 
acters that cannot be included in a filename. The first is the / character 
since it is used to separate elements of a pathname, and the second is the 
null character (a zero byte), which is used internally to mark the ends of 
strings. Everything else is legal including spaces, tabs, line feeds, leading 
hyphens, carriage returns, and so on. 


Of particular concern are leading hyphens. For example, it’s perfectly 
legal to have a file named -7rf ~. Consider for a moment what happens when 
that filename is passed to xm. 

To defend against this problem, we want to change our rm command in 
the file deletion script from this: 


rm * 


to the following: 


rm ./* 


This will prevent a filename starting with a hyphen from being inter- 
preted as a command option. As a general rule, always precede wildcards 
(such as * and ?) with ./ to prevent misinterpretation by commands. This 
includes things like *.pdf and ???.mp3, for example. 


PORTABLE FILENAMES 


To ensure that a filename is portable between multiple platforms (i.e., different 
types of computers and operating systems), care must be taken to limit which 
characters are included in a filename. There is a standard called the POSIX 
Portable Filename Character Set that can be used to maximize the chances that 


a filename will work across different systems. The standard is pretty simple. The 


only characters allowed are the uppercase letters A-Z, the lowercase letters 
a-z, the numerals O-9, period (.), hyphen (-), and underscore (_). The standard 
further suggests that filenames not begin with a hyphen. 


Verifying Input 

A general rule of good programming is that if a program accepts input, 

it must be able to deal with anything it receives. This usually means that 
input must be carefully screened to ensure that only valid input is accepted 
for further processing. We saw an example of this in the previous chapter 
when we studied the read command. One script contained the following 
test to verify a menu selection: 


[[ $REPLY =~ *[0-3]$ ]] 


This test is very specific. It will return a zero exit status only if the string 
entered by the user is a numeral in the range of zero to three. Nothing else 
will be accepted. Sometimes these kinds of tests can be challenging to write, 
but the effort is necessary to produce a high-quality script. 
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DESIGN IS A FUNCTION OF TIME 


When | was a college student studying industrial design, a wise professor 
stated that the amount of design on a project was determined by the amount 
of time given to the designer. If you were given five minutes to design a device 
“that kills flies,” you designed a flyswatter. If you were given five months, you 
might come up with a laser-guided “antifly system” instead. 

The same principle applies to programming. Sometimes a “quick-and-dirty” 
script will do if it’s going to be used once and only by the programmer. That 
kind of script is common and should be developed quickly to make the effort 
economical. Such scripts don’t need a lot of comments and defensive checks. On 


the other hand, if a script is intended for production use, that is, a script that will 


be used over and over for an important task or by multiple users, it needs much 
more careful development. 


Testing is an important step in every kind of software development, includ- 
ing scripts. There is a saying in the open source world, “release early, release 
often,” that reflects this fact. By releasing early and often, software gets more 
exposure to use and testing. Experience has shown that bugs are much easier 
to find, and much less expensive to fix, if they are found early in the develop- 
ment cycle. 

In Chapter 26, we saw how stubs can be used to verify program flow. 
From the earliest stages of script development, they are a valuable tech- 
nique to check the progress of our work. 

Let’s look at the file-deletion problem shown previously and see how 
this could be coded for easy testing. Testing the original fragment of code 
would be dangerous since its purpose is to delete files, but we could modify 
the code to make the test safe. 


if [[ -d $dir_name ]]; then 
if cd $dir_name; then 
echo rm * # TESTING 


else 
echo "cannot cd to '$dir_name'" >&2 
exit 1 
fi 
else 
echo "no such directory: '$dir_name'" >&2 
exit 1 
fi 


exit # TESTING 


Because the error conditions already output useful messages, we don’t 
have to add any. The most important change is placing an echo command just 


before the rm command to allow the command and its expanded argument 
list to be displayed, rather than the command actually being executed. This 
change allows safe execution of the code. At the end of the code fragment, 
we place an exit command to conclude the test and prevent any other part of 
the script from being carried out. The need for this will vary according to the 
design of the script. 

We also include some comments that act as “markers” for our test- 
related changes. These can be used to help find and remove the changes 
when testing is complete. 


Test Cases 


To perform useful testing, it’s important to develop and apply good test 
cases. This is done by carefully choosing input data or operating conditions 
that reflect edge and corner cases. In our code fragment (which is simple), we 
want to know how the code performs under three specific conditions. 


e  dir_name contains the name of an existing directory. 
e  dir_name contains the name of a nonexistent directory. 


e  dir_name is empty. 


By performing the test with each of these conditions, good test coverage 
is achieved. 

Just as with design, testing is a function of time, as well. Not every script 
feature needs to be extensively tested. It’s really a matter of determining what 
is Most important. Since it could be so potentially destructive if it malfunc- 
tioned, our code fragment deserves careful consideration during both its 
design and testing. 


Debugging 


If testing reveals a problem with a script, the next step is debugging. “A 
problem” usually means that the script is, in some way, not performing to 
the programmer’s expectations. If this is the case, we need to carefully 
determine exactly what the script is actually doing and why. Finding bugs 
can sometimes involve a lot of detective work. 

A well-designed script will try to help. It should be programmed defen- 
sively to detect abnormal conditions and provide useful feedback to the user. 
Sometimes, however, problems are quite strange and unexpected, and more 
involved techniques are required. 


Finding the Problem Area 


In some scripts, particularly long ones, it is sometimes useful to isolate the 
area of the script that is related to the problem. This won’t always be the 
actual error, but isolation will often provide insights into the actual cause. 
One technique that can be used to isolate code is “commenting out” sections 
of a script. For example, our file deletion fragment could be modified to 
determine whether the removed section was related to an error. 
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if [[ -d $dir_name ]]; then 
if cd $dir_name; then 


rm * 
else 
echo "cannot cd to '$dir_name'" >&2 
exit 1 
fi 
# else 
# echo "no such directory: '$dir_name'" >&2 
# exit 1 


fi 


By placing comment symbols at the beginning of each line in a logical 
section of a script, we prevent that section from being executed. Testing can 
then be performed again to see whether the removal of the code has any 
impact on the behavior of the bug. 


Tracing 


Bugs are often cases of unexpected logical flow within a script. That is, por- 
tions of the script either are never being executed or are being executed in 
the wrong order or at the wrong time. To view the actual flow of the pro- 
gram, we use a technique called tracing. 

One tracing method involves placing informative messages in a script that 
display the location of execution. We can add messages to our code fragment. 


echo "preparing to delete files" >&2 
if [[ -d $dir_name ]]; then 

if cd $dir_name; then 
echo "deleting files" >&2 


rm * 
else 
echo "cannot cd to '$dir_name'" >&2 
exit 1 
fi 
else 
echo "no such directory: '$dir_name'" >&2 
exit 1 
fi 


echo "file deletion complete" >&2 


We send the messages to standard error to separate them from normal 
output. We also do not indent the lines containing the messages, so it is 
easier to find when it’s time to remove them. 

Now when the script is executed, it’s possible to see that the file deletion 
has been performed. 


[me@linuxbox ~]$ deletion-script 
preparing to delete files 
deleting files 


file deletion complete 
[me@linuxbox ~]$ 


bash also provides a method of tracing, implemented by the -x option and 
the set command with the -x option. Using our earlier trouble script, we can 
activate tracing for the entire script by adding the -x option to the first line. 


#!/bin/bash -x 
# trouble: script to demonstrate common errors 
number=1 


if [ $number = 1 ]; then 

echo "Number is equal to 1." 
else 

echo "Number is not equal to 1." 
fi 


When executed, the results look like this: 


[me@linuxbox ~]$ trouble 

+ number=1 

+ ' [ ' 1 = 1 ' ] ' 

+ echo 'Number is equal to 1.' 
Number is equal to 1. 


With tracing enabled, we see the commands performed with expansions 
applied. The leading plus signs indicate the display of the trace to distinguish 
them from lines of regular output. The plus sign is the default character for 
trace output. It is contained in the PS4 (prompt string 4) shell variable. The 
contents of this variable can be adjusted to make the prompt more useful. 
Here, we modify the contents of the variable to include the current line 
number in the script where the trace is performed. Note that single quotes 
are required to prevent expansion until the prompt is actually used. 


[me@linuxbox ~]$ export PS4="$LINENO + ' 
[me@linuxbox ~]$ trouble 

5 + number=1 

Toe a Se a 

8 + echo ‘Number is equal to 1.' 

Number is equal to 1. 


To perform a trace on a selected portion of a script, rather than the 
entire script, we can use the set command with the -x option. 


#!/bin/bash 
# trouble: script to demonstrate common errors 


number=1 
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set -x # Turn on tracing 
if [ $number = 1 ]; then 

echo "Number is equal to 1." 
else 

echo "Number is not equal to 1." 
fi 
set +x # Turn off tracing 


We use the set command with the -x option to activate tracing and the +x 
option to deactivate tracing. This technique can be used to examine multiple 
portions of a troublesome script. 


Examining Values During Execution 


It is often useful, along with tracing, to display the content of variables to 
see the internal workings of a script while it is being executed. Applying 
additional echo statements will usually do the trick. 


#!/bin/bash 
# trouble: script to demonstrate common errors 
number=1 


echo "number=$number" # DEBUG 
set -x # Turn on tracing 
if [ $number = 1 ]; then 

echo "Number is equal to 1." 
else 

echo "Number is not equal to 1." 
fi 
set +x # Turn off tracing 


In this trivial example, we simply display the value of the variable 
number and mark the added line with a comment to facilitate its later 
identification and removal. This technique is particularly useful when 
watching the behavior of loops and arithmetic within scripts. 


Summing Up 
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In this chapter, we looked at just a few of the problems that can crop up dur- 
ing script development. Of course, there are many more. The techniques 
described here will enable finding most common bugs. Debugging is a fine 
art that is developed through experience, both in knowing how to avoid bugs 
(testing constantly throughout development) and in finding bugs (effective 
use of tracing). 
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FLOW CONTROL: 
BRANCHING WITH CASE 


In this chapter, we will continue our look at 
flow control. In Chapter 28, we constructed 


some simple menus and built the logic used 

to act on a user’s selection. To do this, we used 
a series of if commands to identify which of the pos- 
sible choices had been selected. This type of logical 
construct appears frequently in programs—so much 
so that many programming languages (including the 
shell) provide a special flow control mechanism for 
multiple-choice decisions. 


The case Command 


In bash, the multiple-choice compound command is called case. It has the 
following syntax. 


case word in 
[pattern [| pattern]...) commands ;;]... 
esac 


If we look at the read-menu program from Chapter 28, we see the logic 
used to act on a user’s selection. 


#!/bin/bash 
# read-menu: a menu driven system information program 
clear 


echo 
Please Select: 


Display System Information 
Display Disk Space 

Display Home Space Utilization 
Quit 


OWNRB 
aay ie le 


read -p "Enter selection [0-3] > 


if [[ "$REPLY" =~ *[0-3]$ ]]; then 
if [[ "$REPLY" == 0 ]]; then 
echo "Program terminated." 
exit 
fi 
if [[ "$REPLY" == 1 ]]; then 
echo "Hostname: $HOSTNAME" 


uptime 
exit 
fi 
if [[ "$REPLY" == 2 ]]; then 
df -h 
exit 
fi 


if [[ "$REPLY" == 3 ]]; then 
if [[ "$(id -u)" -eq 0 ]]; then 
echo "Home Space Utilization (All Users)" 
du -sh /home/* 


else 
echo "Home Space Utilization ($USER)" 
du -sh "$HOME" 

fi 

exit 


fi 
else 
echo "Invalid entry." >&2 
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exit 1 
fi 


Using case, we can replace this logic with something simpler. 


#!/bin/bash 
# case-menu: a menu driven system information program 
clear 


echo 
Please Select: 


Display System Information 
Display Disk Space 

Display Home Space Utilization 
Quit 


OWNRB 
en tee a Se 8 


read -p "Enter selection [0-3] > 


case "$REPLY" in 


0) echo "Program terminated." 
exit 
33 
1) echo "Hostname: $HOSTNAME" 
uptime 
33 
2) df -h 
33 
3) if [[ "$(id -u)" -eq 0 ]]; then 


echo "Home Space Utilization (All Users)" 
du -sh /home/* 


else 
echo "Home Space Utilization ($USER)" 
du -sh "$HOME" 
fi 
33 
*) echo "Invalid entry" >&2 
exit 1 


33 
esac 


The case command looks at the value of word, which in our example is 
the value of the REPLY variable, and then attempts to match it against one 
of the specified patterns. When a match is found, the commands associated 
with the specified pattern are executed. After a match is found, no further 
matches are attempted. 


Patterns 


The patterns used by case are the same as those used by pathname expan- 
sion. Patterns are terminated with a ) character. Table 31-1 describes some 
valid patterns. 
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Table 31-1: case Pattern Examples 


Pattern Description 
a) Matches if word equals a. 


[[:alpha:]]) Matches if word is a single alphabetic character. 


2???) Matches if word is exactly three characters long. 
*. txt) Matches if word ends with the characters .fxt. 
”) Matches any value of word. It is good practice to include this as the 


last pattern in a case command to catch any values of word that did not 
match a previous pattern, that is, to catch any possible invalid values. 


Here is an example of patterns at work: 


#!/bin/bash 


read -p "enter word > 


case "$REPLY" in 
[[:alpha:]]) echo "is a single alphabetic character." ;; 
[ABC] [0-9] ) echo "is A, B, or C followed by a digit." ;; 


2???) echo "is three characters long." ;; 
* txt) echo "is a word ending in '.txt'" ;; 
*) echo "is something else." ;; 

esac 


It is also possible to combine multiple patterns using the vertical bar 
character as a separator. This creates an “or” conditional pattern. This is 
useful for such things as handling both uppercase and lowercase characters. 
Here’s an example: 


#!/bin/bash 
# case-menu: a menu driven system information program 
clear 


echo 
Please Select: 


Display System Information 
Display Disk Space 

Display Home Space Utilization 
Quit 


oA wW YS 


read -p "Enter selection [A, B, C or Q] > 
case "$REPLY" in 
qlQ) echo "Program terminated." 
exit 


a>”, 
a[A) echo "Hostname: $HOSTNAME" 
uptime 
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a>”, 
b|B) df -h 
c|C) if [[ "$(id -u)" -eq 0 J]; then 
echo "Home Space Utilization (All Users)" 
du -sh /home/* 
else 
echo "Home Space Utilization ($USER)" 
du -sh "$HOME" 
fi 
*) echo "Invalid entry" >&2 
exit 1 
3 
esac 


Here, we modify the case-menu program to use letters instead of digits 
for menu selection. Notice how the new patterns allow for entry of both 
uppercase and lowercase letters. 


Performing Multiple Actions 


In versions of bash prior to 4.0, case allowed only one action to be per- 
formed on a successful match. After a successful match, the command 
would terminate. Here we see a script that tests a character: 


#!/bin/bash 
# case4-1: test a character 


read -n 1 -p "Type a character > 


echo 

case "$REPLY" in 
[[ :upper: ]]) echo "'$REPLY' is upper case." ;; 
[[: lower: ]]) echo "'$REPLY' is lower case." ;; 
[[:alpha:]]) echo "'$REPLY' is alphabetic." ;; 
[[:digit:]]) echo "'$REPLY' is a digit." ;; 
[[:graph:]]) echo "'$REPLY' is a visible character." ;; 
[[:punct:]]) echo "'$REPLY' is a punctuation symbol." ;; 
[[:space:]]) echo "'$REPLY' is a whitespace character." ;; 
[[:xdigit:]]) echo "'$REPLY' is a hexadecimal digit." ;; 


esac 


Running this script produces this: 


[me@linuxbox ~]$ case4-1 
Type a character > a 


a' is lower case. 


The script works for the most part but fails if a character matches more 
than one of the POSIX character classes. For example, the character a is 
both lowercase and alphabetic, as well as a hexadecimal digit. In bash prior 
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to version 4.0, there was no way for case to match more than one test. Modern 
versions of bash add the ;;& notation to terminate each action, so now we 
can do this: 


#!/bin/bash 
# case4-2: test a character 


read -n 1 -p "Type a character > 


echo 

case "$REPLY" in 
[[ :upper: ]]) echo "'$REPLY' is upper case.” 3;& 
[[: lower: ]]) echo "'$REPLY' is lower case." 5538 
[[:alpha:]]) echo "'$REPLY' is alphabetic." ;5;& 
[[:digit:]]) echo "'$REPLY' is a digit." ;;8& 
[[:graph:]]) echo "'$REPLY' is a visible character." ;;& 
[[:punct:]]) echo "'$REPLY' is a punctuation symbol." ;;& 
[[:space:]]) echo "'$REPLY' is a whitespace character." ;;& 
[[:xdigit:]]) echo "'$REPLY' is a hexadecimal digit." ;;& 


esac 


When we run this script, we get this: 


[me@linuxbox ~]$ case4-2 
Type a character > a 

"a' is lower case. 

is alphabetic. 

is a visible character. 


is a hexadecimal digit. 


The addition of the ;;& syntax allows case to continue to the next test 
rather than simply terminating. 


Summing Up 
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The case command is a handy addition to our bag of programming tricks. 
As we will see in the next chapter, it’s the perfect tool for handling certain 
types of problems. 


POSITIONAL PARAMETERS 


One feature that has been missing from 
our programs so far is the ability to accept 
and process command line options and 


arguments. In this chapter, we will examine 
the shell features that allow our programs to get 
access to the contents of the command line. 


Accessing the Command Line 


The shell provides a set of variables called positional parameters that contain 
the individual words on the command line. The variables are named 0 
through 9. They can be demonstrated this way: 


#!/bin/bash 
# posit-param: script to view command line parameters 


echo " 
\$0 = $0 
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\$2 = $2 
\$3 = $3 
\$4 = $4 
\$5 = $5 
\$6 = $6 
\$7 = $7 
\$8 = $8 
\$9 = $9 


This is a simple script that displays the values of the variables $0—$9. 
When executed with no command line arguments, the result is this: 


[me@linuxbox ~]$ posit-param 


$0 = /home/me/bin/posit-param 
$1 = 


Even when no arguments are provided, $0 will always contain the first 
item appearing on the command line, which is the pathname of the pro- 
gram being executed. When arguments are provided, we see these results: 


[me@linuxbox ~]$ posit-param a b c d 


$0 = /home/me/bin/posit-param 
$1 =a 

$2 = b 
$3 = ¢ 
$4 =d 
$5 = 
$6 = 
$7 = 
$8 = 
$9 = 


You can actually access more than nine parameters using parameter expansion. To 
specify a number greater than nine, surround the number in braces, as in ${10}, 
${55}, ${211}, and so on. 


Determining the Number of Arguments 


The shell also provides a variable, $#, that contains the number of argu- 
ments on the command line. 


#!/bin/bash 


# posit-param: script to view command line parameters 


echo " 
Number of arguments: $# 
\$0 = $0 
\$1 = $1 
\$2 = $2 
\$3 = $3 
\$4 = $4 
\$5 = $5 
\$6 = $6 
\$7 = $7 
\$8 = $8 


\$9 = $9 


This is the result: 


[me@linuxbox ~]$ posit-param a b c d 


Number of arguments: 4 

$0 = /home/me/bin/posit-param 
$1 =a 

$2 = b 
$3 = ¢ 
$4 = 
$5 = 
$6 = 
$7 = 
$8 = 
$9 = 


shift—Getting Access to Many Arguments 


But what happens when we give the program a large number of arguments 
such as the following? 


[me@linuxbox ~]$ posit-param * 


Number of arguments: 82 

$0 = /home/me/bin/posit-param 

$1 = addresses. ldif 

$2 = bin 

$3 = bookmarks .html 

$4 = debian-500-1386-netinst.iso 
$5 = debian-500-1386-netinst.jigdo 
$6 = debian-500-1386-netinst. template 
$7 = debian-cd_info.tar.gz 

$8 = Desktop 

$9 = dirlist-bin.txt 
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On this example system, the wildcard * expands into 82 arguments. How 
can we process that many? The shell provides a method, albeit a clumsy one, 
to do this. The shift command causes all the parameters to “move down one” 
each time it is executed. In fact, by using shift, it is possible to get by with 
only one parameter (in addition to $0, which never changes). 


#!/bin/bash 
# posit-param2: script to display all arguments 
count=1 


while [[ $# -gt 0 ]]; do 
echo "Argument $count = $1" 
count=$((count + 1)) 
shift 

done 


Each time shift is executed, the value of $2 is moved to $1, the value of 
$3 is moved to $2, and so on. The value of $# is also reduced by one. 

In the posit-param2 program, we create a loop that evaluates the number 
of arguments remaining and continues as long as there is at least one. We 
display the current argument, increment the variable count with each itera- 
tion of the loop to provide a running count of the number of arguments 
processed, and, finally, execute a shift to load $1 with the next argument. 
Here is the program at work: 


[me@linuxbox ~]$ posit-param2 a b c d 
Argument 1 = 
Argument 2 = 
Argument 3 = 
Argument 4 = 


aq ow 


Simple Applications 


Even without shift, it’s possible to write useful applications using positional 
parameters. By way of example, here is a simple file information program: 


#!/bin/bash 
# file-info: simple file information program 
PROGNAME="$(basename "$o")" 


if [[ -e "$1" ]]; then 
echo -e "\nFile Type:" 
file "$1" 
echo -e "\nFile Status:" 
stat "$1" 
else 
echo "$PROGNAME: usage: $PROGNAME file" >&2 


exit 1 
fi 


This program displays the file type (determined by the file command) 
and the file status (from the stat command) of a specified file. One interest- 
ing feature of this program is the PROGNAME variable. It is given the value that 
results from the basename "$0" command. The basename command removes the 
leading portion of a pathname, leaving only the base name of a file. In our 
example, basename removes the leading portion of the pathname contained 
in the $0 parameter, the full pathname of our example program. This value 
is useful when constructing messages such as the usage message at the end 
of the program. By coding it this way, the script can be renamed, and the 
message automatically adjusts to contain the name of the program. 


Using Positional Parameters with Shell Functions 


Just as positional parameters are used to pass arguments to shell scripts, 
they can also be used to pass arguments to shell functions. To demonstrate, 
we will convert the file_info script into a shell function. 


file info () { 
# file_info: function to display file information 


if [[ -e "$1" ]]; then 
echo -e "\nFile Type:" 


file "$1" 
echo -e "\nFile Status:" 
stat "$1" 
else 
echo "$FUNCNAME: usage: $FUNCNAME file" >&2 
return 1 
fi 


Now, if a script that incorporates the file_info shell function calls the 
function with a filename argument, the argument will be passed to the 
function. 

With this capability, we can write many useful shell functions that not 
only can be used in scripts but also can be used within our .bashrc files. 

Notice that the PROGNAME variable was changed to the shell variable 
FUNCNAME. The shell automatically updates this variable to keep track of the 
currently executed shell function. Note that $0 always contains the full path- 
name of the first item on the command line (i.e., the name of the program) 
and does not contain the name of the shell function as we might expect. 


Handling Positional Parameters en Masse 


It is sometimes useful to manage all the positional parameters as a group. 
For example, we might want to write a “wrapper” around another program. 
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This means we create a script or shell function that simplifies the invocation 
of another program. The wrapper, in this case, supplies a list of arcane com- 
mand line options and then passes a list of arguments to the lower-level 
program. 

The shell provides two special parameters for this purpose. They both 
expand into the complete list of positional parameters but differ in rather 
subtle ways. Table 32-1 describes these parameters. 


Table 32-1: The * and @ Special Parameters 


Parameter _ Description 


$* Expands into the list of positional parameters, starting with 1. When 
surrounded by double quotes, it expands into a double-quoted string 
containing all of the positional parameters, each separated by the first 
character of the IFS shell variable (by default a space character). 


$@ Expands into the list of positional parameters, starting with 1. When sur- 
rounded by double quotes, it expands each positional parameter into a 
separate word as if it was surrounded by double quotes. 


Here is a script that shows these special parameters in action: 


#!/bin/bash 
# posit-params3: script to demonstrate $* and $@ 


print_params () { 


echo "\$1 = $1" 
echo "\$2 = $2" 
echo "\$3 = $3" 
echo "\$4 = $4" 


} 


pass params () { 
echo -e "\n" '$* :';  print_params $* 
echo -e "\n" '"$*" :'; print_params "$*" 
echo -e "\n" '$@ :';  print_params $@ 
echo -e "\n" '"$@" :'; print_params "$@" 


} 


pass_ params "word" "words with spaces" 


In this rather convoluted program, we create two arguments, called 
word and words with spaces, and pass them to the pass_params function. That 
function, in turn, passes them on to the print_params function, using each 
of the four methods available with the special parameters $* and $@. When 
executed, the script reveals the differences. 


[me@linuxbox ~]$ posit-param3 


$* : 


$1 = word 
$2 = words 
$3 = with 
$4 = spaces 


"gam 
$1 = word words with spaces 
$2 = 

$3 = 

$4 = 

$@ : 

$1 = word 

$2 = words 

$3 = with 


$4 = spaces 


"$@" : 

$1 = word 

$2 = words with spaces 
$3 = 

$4 = 


With our arguments, both $* and $@ produce a four-word result. 


word words with spaces 


"$*" produces a one-word result. 


"word words with spaces" 


"$@" produces a two-word result. 


"word" "words with spaces" 


This matches our actual intent. The lesson to take from this is that 
even though the shell provides four different ways of getting the list of 
positional parameters, "$@" is by far the most useful for most situations 
because it preserves the integrity of each positional parameter. To ensure 
safety, it should always be used, unless we have a compelling reason not 
to use it. 


A More Complete Application 


After a long hiatus, we are going to resume work on our sys_info_page pro- 
gram, last seen in Chapter 27. Our next addition will add several command 
line options to the program as follows: 


Output file We will add an option to specify a name for a file to 
contain the program’s output. It will be specified as either -f file or 
--file file. 
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Interactive mode This option will prompt the user for an output file- 
name and will determine whether the specified file already exists. If it 
does, the user will be prompted before the existing file is overwritten. 
This option will be specified by either -i or --interactive. 


Help Either -h or --help may be specified to cause the program to 
output an informative usage message. 


Here is the code needed to implement the command line processing: 


usage () { 
echo "$PROGNAME: usage: $PROGNAME [-f file | -i]" 
return 

} 


# process command line options 


interactive= 
filename= 


while [[ -n "$1" ]]; do 
case "$1" in 
-f | --file) shift 
filename="$1" 


+>”) 


-i | --interactive) interactive=1 
3 
-h | --help) usage 
exit 
3 
*) usage >8&2 
exit 1 
3 
esac 
shift 


done 


First, we add a shell function called usage to display a message when the 
help option is invoked or an unknown option is attempted. 

Next, we begin the processing loop. This loop continues while the 
positional parameter $1 is not empty. At the end of the loop, we have a shift 
command to advance the positional parameters to ensure that the loop will 
eventually terminate. 

Within the loop, we have a case statement that examines the current posi- 
tional parameter to see whether it matches any of the supported choices. 
If a supported parameter is found, it is acted upon. If an unknown choice 
is found, the usage message is displayed, and the script terminates with an 
error. 

The -f parameter is handled in an interesting way. When detected, it 
causes an additional shift to occur, which advances the positional parameter 
$1 to the filename argument supplied to the -f option. 

We next add the code to implement the interactive mode. 


# interactive mode 


if [[ -n "$interactive" ]]; then 
while true; do 
read -p "Enter name of output file: " filename 
if [[ -e "$filename" ]]; then 


read -p "'$filename' exists. Overwrite? [y/n/q] > " 
case "$REPLY" in 
Yly) break 
3 
Q|q) echo "Program terminated." 
exit 
3 
*) continue 
33 
esac 
elif [[ -z "$filename" ]]; then 
continue 
else 
break 
fi 


done 
fi 


If the interactive variable is not empty, an endless loop is started, which 
contains the filename prompt and subsequent existing file-handling code. 
If the desired output file already exists, the user is prompted to overwrite, 
choose another filename, or quit the program. If the user chooses to over- 
write an existing file, a break is executed to terminate the loop. Notice how 
the case statement detects only whether the user chooses to overwrite or quit. 
Any other choice causes the loop to continue and prompts the user again. 

To implement the output filename feature, we must first convert the 
existing page-writing code into a shell function, for reasons that will 
become clear in a moment. 


write _html_page () { 
cat <<- _EOF_ 
<html> 
<head> 
<title>$TITLE</title> 
</head> 
<body> 
<h1>$TITLE</h1> 
<p>$TIMESTAMP</p> 
$(report_uptime) 
$(report_disk_space) 
$(report_home_space) 
</body> 
</html> 
_EOF_ 
return 
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# output html page 


if [[ -n "$filename" ]]; then 
if touch "$filename" && [[ -f "$filename" ]]; then 
write_html_page > "$filename" 
else 
echo "$PROGNAME: Cannot write file '$filename'" >&2 
exit 1 
fi 
else 
write_html_page 
fi 


The code that handles the logic of the -f option appears at the end of 
the previous listing. In it, we test for the existence of a filename, and if one 
is found, a test is performed to see whether the file is indeed writable. To 
do this, a touch is performed, followed by a test to determine whether the 
resulting file is a regular file. These two tests take care of situations where 
an invalid pathname is input (touch will fail), and, if the file already exists, 
that it’s a regular file. 

As we can see, the write_html_page function is called to perform the actual 
generation of the page. Its output is either directed to standard output (if the 
variable filename is empty) or redirected to the specified file. Since we have 
two possible destinations for the HTML code, it makes sense to convert the 
write_html_page routine to a shell function to avoid redundant code. 


Summing Up 
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With the addition of positional parameters, we can now write fairly func- 
tional scripts. For simple, repetitive tasks, positional parameters make it 
possible to write very useful shell functions that can be placed in a user’s 
.bashrc file. 

Our sys_info_page program has grown in complexity and sophistication. 
Here is a complete listing, with the most recent changes highlighted: 


#!/bin/bash 
# sys_info_page: program to output a system information page 


PROGNAME="$(basename "$0")" 

TITLE="System Information Report For $HOSTNAME" 
CURRENT _TIME="$(date +"%x %r %Z")" 
TIMESTAMP="Generated $CURRENT TIME, by $USER" 


report_uptime () { 
cat <<- _EOF_ 
<h2>System Uptime</h2> 
<pre>$(uptime)</pre> 
_EOF_ 
return 


} 


report_disk_space () { 
cat <<- _EOF_ 
<h2>Disk Space Utilization</h2> 
<pre>$(df -h)</PRE> 
_EOF_ 
return 


} 


report_home_space () { 
if [[ "$(id -u)" -eq 0 ]]; then 
cat <<- _EOF_ 
<h2>Home Space Utilization (All Users)</h2> 
<pre>$(du -sh /home/*)</pre> 
_EOF_ 
else 
cat <<- _EOF_ 
<h2>Home Space Utilization ($USER)</h2> 
<pre>$(du -sh "$HOME")</pre> 
_EOF_ 
fi 
return 


} 


usage () { 
echo "$PROGNAME: usage: $PROGNAME [-f file | -i]" 


return 


} 


write_html_page () { 
cat <<- _EOF_ 
<html> 
<head> 
<title>$TITLE</title> 
</head> 
<body> 
<h1>$TITLE</h1> 
<p>$TIMESTAMP</p> 
$(report_uptime) 
$(report_disk_space) 
$(report_home_space) 
</body> 
</html> 
_EOF_ 
return 


} 
# process command line options 


interactive= 
filename= 


while [[ -n "$1" ]]; do 
case "$1" in 
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-f | --file) shift 
filename="$1" 


oe 
a) 


-i | --interactive) interactive=1 
33 
-h | --help) usage 
exit 
3) 
*) usage >&2 
exit 1 
33 
esac 
shift 


done 
# interactive mode 


if [[ -n "$interactive" ]]; then 
while true; do 
read -p "Enter name of output file: " filename 
if [[ -e "$filename" ]]; then 


read -p "'$filename' exists. Overwrite? [y/n/q] > " 
case "$REPLY" in 
Yly) break 
3) 
Q|q) echo "Program terminated." 
exit 
3) 
*) continue 
3) 
esac 
elif [[ -z "$filename" ]]; then 
continue 
else 
break 
fi 
done 


fi 
# output html page 
if [[ -n "$filename" ]]; then 


if touch "$filename" && [[ -f "$filename" ]]; then 
write_html_page > "$filename" 


else 
echo "$PROGNAME: Cannot write file '$filename'" >&2 
exit 1 
fi 
else 
write_html_page 
fi 


We’re not done yet. There are still a few more things we can do and 
improvements we can make. 
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FLOW CONTROL: LOOPING 
WITH FOR 


In this final chapter on flow control, we 
will look at another of the shell’s looping 
constructs. The for loop differs from the while 
and until loops in that it provides a means of 
processing sequences during a loop. This turns out to 
be very useful when programming. Accordingly, the 


for loop is a popular construct in bash scripting. 


A for loop is implemented, naturally enough, with the for compound 
command. In bash, for is available in two forms. 


for: Traditional Shell Form 


The original for command ’s syntax is as follows: 


for variable [in words]; do 
commands 
done 


where variable is the name of a variable that will increment during the exe- 
cution of the loop, words is an optional list of items that will be sequentially 
assigned to variable, and commands are the commands that are to be executed 
on each iteration of the loop. 

The for command is useful on the command line. We can easily dem- 
onstrate how it works. 


me@linuxbox ~]$ for i in A B C D; do echo $i; done 


[ 
A 
B 
C 
D 


In this example, for is given a list of four words: A, B, C, and D. With 
a list of four words, the loop is executed four times. Each time the loop is 
executed, a word is assigned to the variable i. Inside the loop, we have an 
echo command that displays the value of i to show the assignment. As with 
the while and until loops, the done keyword closes the loop. 

The really powerful feature of for is the number of interesting ways 
we can create the list of words. For example, we can do it through brace 
expansion, like so: 


me@linuxbox ~]$ for i in {A..D}; do echo $i; done 


[ 
A 
B 
C 
D 


or we could use a pathname expansion, as follows: 


[me@linuxbox ~]$ for i in distros*.txt; do echo "$i"; done 
distros-by-date.txt 

distros-dates.txt 

distros-key-names.txt 

distros-key-vernums. txt 

distros-names.txt 

distros.txt 

distros-vernums.txt 

distros-versions.txt 


Pathname expansion provides a nice, clean list of pathnames that can 
be processed in the loop. The one precaution needed is to check that the 
expansion actually matched something. By default, if the expansion fails 
to match any files, the wildcards themselves (distros*.txt in the preceding 
example) will be returned. To guard against this, we would code the pre- 
ceding example in a script this way: 


for i in distros*.txt; do 
if [[ -e "$i" ]]; then 
echo "$i" 
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fi 
done 


By adding a test for file existence, we will ignore a failed expansion. 
Another common method of word production is command substitution. 


#!/bin/bash 


# longest-word: find longest string in a file 


while [[ -n "$1" ]]; do 
if [[ -r "$1" ]]; then 
max_word= 
max_len=0 
for i in $(strings "$1"); do 
len="$(echo -n "$i" | we -c)" 
if (( len > max_len )); then 
max_len="$len" 
max_word="$i" 
fi 
done 
echo "$1: '$max_word' ($max_len characters)" 
fi 
shift 
done 


In this example, we look for the longest string found within a file. When 
given one or more filenames on the command line, this program uses the 
strings program (which is included in the GNU binutils package) to gener- 
ate a list of readable text “words” in each file. The for loop processes each 
word in turn and determines whether the current word is the longest found 
so far. When the loop concludes, the longest word is displayed. 

One thing to note here is that, contrary to our usual practice, we do 
not surround the command substitution $(strings "$1") with double quotes. 
This is because we actually want word splitting to occur to give us our list. 
If we had surrounded the command substitution with quotes, it would pro- 
duce only a single word containing every string in the file. That’s not exactly 
what we are looking for. 

If the optional in words portion of the for command is omitted, for 
defaults to processing the positional parameters. We will modify our 
longest-word script to use this method: 


#!/bin/bash 


# longest-word2: find longest string in a file 


for i; do 
if [[ -r "$i" ]]; then 
max_word= 
max_len=0 


for j in $(strings "$i"); do 
len="$(echo -n "$35" | we -c)" 
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if (( len > max_len )); then 
max_len="$len" 
max_word="$j" 
fi 
done 
echo "$i: '$max_word' ($max_len characters)" 
fi 
done 


As we can see, we have changed the outermost loop to use for in place 
of while. By omitting the list of words in the for command, the positional 
parameters are used instead. Inside the loop, previous instances of the vari- 
able i have been changed to the variable j. The use of shift has also been 
eliminated. 


WHY I? 


You may have noticed that the variable i was chosen for each of the previous 
for loop examples. Why? No specific reason actually besides tradition. The 


variable used with for can be any valid variable, but i is the most common, 
followed by j and k. 
The basis of this tradition comes from the Fortran programming language. 


In Fortran, undeclared variables starting with the letters |, J, K, L, and Mare 
automatically typed as integers, while variables beginning with any other letter 
are typed as reals (numbers with decimal fractions). This behavior led program- 
mers to use the variables /, J, and K for loop variables since it was less work to 
use them when a temporary variable (as loop variables often are) was needed. 

It also led to the following Fortran-based witticism: “GOD is real, unless 
declared integer.” 
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Recent versions of bash have added a second form of the for command syn- 
tax, one that resembles the form found in the C programming language. 
Many other languages support this form, as well. 


for (( expression1; expression2; expression3 )); do 
commands 
done 


Here, expression1, expression2, and expression3 are arithmetic expres- 
sions, and commands are the commands to be performed during each itera- 
tion of the loop. 

In terms of behavior, this form is equivalent to the following construct. 


(( expression1 )) 
while (( expression2 )); do 
commands 
(( expression3 )) 
done 


expression1 is used to initialize conditions for the loop, expression2 is 
used to determine when the loop is finished, and expression3 is carried out 
at the end of each iteration of the loop. 

Here is a typical application: 


#!/bin/bash 
# simple counter: demo of C style for command 
for (( i=0; i<5; i=i+1 )); do 


echo $i 
done 


When executed, it produces the following output: 


me@linuxbox ~]$ simple_counter 


[ 
0 
HE 
2 
3 
4 


In this example, expression1 initializes the variable i with the value of 
zero, expression2 allows the loop to continue as long as the value of i remains 
less than 5, and expression3 increments the value of i by 1 each time the loop 
repeats. 

The C language form of for is useful anytime a numeric sequence is 
needed. We will see several applications for this in the next two chapters. 


Summing Up 


With our knowledge of the for command, we will now apply the final 
improvements to our sys_info_page script. Currently, the report_home_space 
function looks like this: 


report_home_space () { 
if [[ "$(id -u)" -eq 0 J]; then 
cat <<- _EOF_ 
<h2>Home Space Utilization (All Users)</h2> 
<pre>$(du -sh /home/*)</pre> 
_EOF_ 
else 
cat <<- _EOF_ 
<h2>Home Space Utilization ($USER)</h2> 
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<pre>$(du -sh "$HOME")</pre> 
_EOF_ 

fi 

return 


Next, we will rewrite it to provide more detail for each user’s home 
directory and include the total number of files and subdirectories in each. 


report_home_space () { 


local format="%85%10s%10s\n" 
local i dir_list total_files total _dirs total_size user_name 


if [[ "$(id -u)" -eq 0 J]; then 
dir_list=/home/* 
user_name="All Users" 

else 
dir _list="$HOME" 
user_name="$USER" 

fi 


echo "<h2>Home Space Utilization ($user_name)</h2>" 
for i in $dir list; do 


total_files="$(find "$i" -type f | we -1)" 
total_dirs="$(find "$i" -type d | we -1)" 
total_size="$(du -sh "$i" | cut -f 1)" 


echo "<h3>$i</h3>" 
echo "<pre>" 
printf "$format" "Dirs" "Files" "Size" 
printf "$format" "----" " momo" 
printf "$format" "$total_dirs" "$total_ files" "$total_size" 
echo "</pre>" 
done 
return 


This rewrite applies much of what we have learned so far. We still test 
for the superuser, but instead of performing the complete set of actions as 
part of the if, we set some variables used later in a for loop. We have added 
several local variables to the function and made use of printf to format 
some of the output. 
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STRINGS AND NUMBERS 


Computer programs are all about working 
with data. In past chapters, we have focused 
on processing data at the file level. However, 
many programming problems need to be solved 
using smaller units of data such as strings and numbers. 


In this chapter, we will look at several shell features that are used to 
manipulate strings and numbers. The shell provides a variety of param- 
eter expansions that perform string operations. In addition to arithmetic 
expansion (which we touched upon in Chapter 7), there is a well-known 
command line program called bc, which performs higher-level math. 


Parameter Expansion 


Though parameter expansion came up in Chapter 7, we did not cover it in 
detail because most parameter expansions are used in scripts rather than 
on the command line. We have already worked with some forms of param- 
eter expansion, for example, shell variables. The shell provides many more. 


420 


Chapter 34 


It’s always good practice to enclose parameter expansions in double quotes to prevent 
unwanted word splitting, unless there is a specific reason not to. This is especially 
true when dealing with filenames since they can often include embedded spaces and 
other assorted nastiness. 


Basic Parameters 


The simplest form of parameter expansion is reflected in the ordinary use 
of variables. Here’s an example: 


$a 


When expanded, this becomes whatever the variable a contains. Simple 
parameters may also be surrounded by braces. 


${a} 


This has no effect on the expansion but is required if the variable is 
adjacent to other text, which may confuse the shell. In this example, we 
attempt to create a filename by appending the string _file to the contents 
of the variable a: 


[me@linuxbox ~]$ a="foo" 
[me@linuxbox ~]$ echo "$a_file" 


If we perform this sequence of commands, the result will be nothing 
because the shell will try to expand a variable named a_file rather than a. 
This problem can be solved by adding braces around the “real” variable 
name. 


[me@linuxbox ~]$ echo "${a} file" 
foo_file 


We have also seen that positional parameters greater than nine can be 
accessed by surrounding the number in braces. For example, to access the 
eleventh positional parameter, we can do this: 


${11} 


Expansions to Manage Empty Variables 


Several parameter expansions are intended to deal with nonexistent and 
empty variables. These expansions are handy for handling missing posi- 
tional parameters and assigning default values to parameters. Here is one 
such expansion: 


${parameter : -word} 


If parameter is unset (i.e., does not exist) or is empty, this expansion 
results in the value of word. If parameter is not empty, the expansion results 
in the value of parameter. 


[me@linuxbox ~]$ foo= 

[me@linuxbox ~]$ echo ${foo:-"substitute value if unset"} 
substitute value if unset 

[me@linuxbox ~]$ echo $foo 


[me@linuxbox ~]$ foo=bar 

[me@linuxbox ~]$ echo ${foo:-"substitute value if unset"} 
bar 

[me@linuxbox ~]$ echo $foo 

bar 


Here is another expansion, in which we use the equal sign instead of 
a dash: 


${parameter : =word} 


If parameter is unset or empty, this expansion results in the value of word. 
In addition, the value of word is assigned to parameter. If parameter is not empty, 
the expansion results in the value of parameter. 


[me@linuxbox ~]$ foo= 

[me@linuxbox ~]$ echo ${foo:="default value if unset"} 
default value if unset 

[me@linuxbox ~]$ echo $foo 

default value if unset 

[me@linuxbox ~]$ foo=bar 

[me@linuxbox ~]$ echo ${foo:="default value if unset"} 
bar 

[me@linuxbox ~]$ echo $foo 

bar 


Positional and other special parameters cannot be assigned this way. 


Here we use a question mark: 


${parameter : ?word} 


If parameter is unset or empty, this expansion causes the script to exit with 
an error, and the contents of word are sent to standard error. If parameter is not 
empty, the expansion results in the value of parameter. 


[me@linuxbox ~]$ foo= 

[me@linuxbox ~]$ echo ${foo:?"parameter is empty"} 
bash: foo: parameter is empty 

[me@linuxbox ~]$ echo $? 

1 
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[me@linuxbox ~]$ foo=bar 

[me@linuxbox ~]$ echo ${foo:?"parameter is empty"} 
bar 

[me@linuxbox ~]$ echo $? 


Here we use a plus sign: 


${parameter :+word} 


If parameter is unset or empty, the expansion results in nothing. If 
parameter is not empty, the value of word is substituted for parameter; how- 
ever, the value of parameter is not changed. 


[me@linuxbox ~]$ foo= 
[me@linuxbox ~]$ echo ${foo:+"substitute value if set"} 


[me@linuxbox ~]$ foo=bar 
[me@linuxbox ~]$ echo ${foo:+"substitute value if set"} 
substitute value if set 


Expansions That Return Variable Names 


The shell has the ability to return the names of variables. This is used in 
some rather exotic situations. 


${ | prefix*} 
${!prefix@} 


This expansion returns the names of existing variables with names 
beginning with prefix. According to the bash documentation, both forms 
of the expansion perform identically. Here, we list all the variables in the 
environment with names that begin with BASH: 


[me@linuxbox ~]$ echo ${!BASH*} 
BASH BASH_ARGC BASH ARGV BASH COMMAND BASH_COMPLETION BASH COMPLETION DIR 
BASH_LINENO BASH SOURCE BASH _SUBSHELL BASH _VERSINFO BASH VERSION 


String Operations 


There is a large set of expansions that can be used to operate on strings. 
Many of these expansions are particularly well suited for operations on 
pathnames. The following expansion: 


${#parameter} 


expands into the length of the string contained by parameter. Normally, 
parameter is a string; however, if parameter is either @ or *, then the expansion 
results in the number of positional parameters. 


[me@linuxbox ~]$ foo="This string is long." 
[me@linuxbox ~]$ echo "'$foo' is ${#foo} characters long." 
‘This string is long.’ is 20 characters long. 


The following expansions are used to extract a portion of the string 
contained in parameter: 


${parameter: offset} 
${parameter: offset: length} 


The extraction begins at offset characters from the beginning of the 
string and continues until the end of the string, unless length is specified. 


[me@linuxbox ~]$ foo="This string is long." 
[me@linuxbox ~]$ echo ${fo00:5} 

string is long. 

[me@linuxbox ~]$ echo ${fo00:5:6} 

string 


If the value of offset is negative, it is taken to mean it starts from the end 
of the string rather than the beginning. Note that negative values must be 
preceded by a space to prevent confusion with the ${parameter: -word} expan- 
sion. length, if present, must not be less than zero. 

If parameter is @, the result of the expansion is length positional param- 
eters, starting at offset. 


[me@linuxbox ~]$ foo="This string is long." 
[me@linuxbox ~]$ echo ${foo: -5} 

long. 

[me@linuxbox ~]$ echo ${foo: -5:2} 

lo 


The following expansions remove a leading portion of the string con- 
tained in parameter defined by pattern. 


${parameter#pattern} 
${parameter##pattern} 


pattern is a wildcard pattern like those used in pathname expansion. 
The difference in the two forms is that the # form removes the shortest 
match, while the ## form removes the longest match. 


[me@linuxbox ~]$ foo=file.txt.zip 
[me@linuxbox ~]$ echo ${foo#*.} 
txt. zip 

[me@linuxbox ~]$ echo ${foo##*. } 
zip 


Strings and Numbers 423 


424 


Chapter 34 


The following are the same as the previous # and ## expansions, except 
they remove text from the end of the string contained in parameter rather 
than from the beginning. 


${parameterzpattern} 
${parameterzspattern} 


Here is an example: 


[me@linuxbox ~]$ foo=file.txt.zip 
[me@linuxbox ~]$ echo ${f00%.*} 
file.txt 

[me@linuxbox ~]$ echo ${f00%%.*} 
file 


The following expansions perform a search-and-replace operation 
upon the contents of parameter: 


${parameter/pattern/string} 

${parameter//pattern/string} 
${parameter/tpattern/string} 
${parameter/*pattern/string} 


If text is found matching wildcard pattern, it is replaced with the con- 
tents of string. In the normal form, only the first occurrence of pattern 
is replaced. In the // form, all occurrences are replaced. The /# form 
requires that the match occur at the beginning of the string, and the /% 
form requires the match to occur at the end of the string. In every form, 
/string may be omitted, causing the text matched by pattern to be deleted. 


[me@linuxbox ~]$ foo=JPG. JPG 
[me@linuxbox ~]$ echo ${foo/JIPG/ jpg} 


jpg. IPG 
[me@linuxbox ~]$ echo ${foo//JPG/jpg} 


jps- ips 
[me@linuxbox ~]$ echo ${foo/#IPG/jpg} 


jpg. JPG 
[me@linuxbox ~]$ echo ${foo/%JPG/ jpg} 


JPG. jpg 


Parameter expansion is a good thing to know. The string manipula- 
tion expansions can be used as substitutes for other common commands 
such as sed and cut. Expansions can improve the efficiency of scripts 
by eliminating the use of external programs. As an example, we will 
modify the longest-word program discussed in the previous chapter to 
use the parameter expansion ${#j} in place of the command substitution 
$(echo -n $j | wc -c) and its resulting subshell, like so: 


#!/bin/bash 


# longest-word3: find longest string in a file 


for i; do 
if [[ -r "$i" ]]; then 
max_word= 
max_len=0 
for j in $(strings $i); do 
len="${#j}" 
if (( len > max_len )); then 
max_len="$len" 
max_word="$j" 
fi 
done 
echo "$i: '$max_word' ($max_len characters)" 
fi 
done 


Next, we will compare the efficiency of the two versions by using the 
time command. 


[me@linuxbox ~]$ time longest-word2 dirlist-usr-bin.txt 
dirlist-usr-bin.txt: 'scrollkeeper-get-extended-content-list' (38 characters) 


real Om3.618s 
user om1.544s 
sys om1. 768s 


[me@linuxbox ~]$ time longest-word3 dirlist-usr-bin.txt 
dirlist-usr-bin.txt: 'scrollkeeper-get-extended-content-list' (38 characters) 


real om0.060s 
user Omo.056s 
sys omo.008s 


The original version of the script takes 3.618 seconds to scan the text file, 
while the new version, using parameter expansion, takes only 0.06 seconds, 
which is a significant improvement. 


Case Conversion 


bash has four parameter expansions and two declare command options to 
support the uppercase/lowercase conversion of strings. 

So, what is case conversion good for? Aside from the obvious aesthetic 
value, it has an important role in programming. Let’s consider the case of a 
database lookup. Imagine that a user has entered a string into a data input 
field that we want to look up in a database. It’s possible the user will enter 
the value in all uppercase letters or lowercase letters or a combination of 
both. We certainly don’t want to populate our database with every possible 
permutation of uppercase and lowercase spellings. What to do? 

A common approach to this problem is to normalize the user’s input. 
That is, convert it into a standardized form before we attempt the database 
lookup. We can do this by converting all the characters in the user’s input 
to either lower or uppercase and ensure that the database entries are nor- 
malized the same way. 
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The declare command can be used to normalize strings to either upper- 
case or lowercase. Using declare, we can force a variable to always contain 
the desired format no matter what is assigned to it. 


#!/bin/bash 
# ul-declare: demonstrate case conversion via declare 


declare -u upper 
declare -1 lower 


if [[ $1 ]]; then 
upper="$1" 
lower="$1" 
echo "$upper" 
echo "$lower" 
fi 


In the preceding script, we use declare to create two variables, upper and 
lower. We assign the value of the first command line argument (positional 
parameter 1) to each of the variables and then display them on the screen. 


[me@linuxbox ~]$ ul-declare aBc 
ABC 
abc 


As we can see, the command line argument (aBc) has been normalized. 
In addition to declare, there are four parameter expansions that per- 
form upper/lowercase conversion, as described in Table 34-1. 


Table 34-1: Case Conversion Parameter Expansions 


Format Result 


${parameter,,pattern} Expand the value of parameter into all lowercase. pattern is 
an optional shell pattern that will limit which characters (for 
example, [A-F]} will be converted. See the bash man page 
for a full description of patterns. 

${parameter , pattern} Expand the value of parameter, changing only the first char- 
acter to lowercase. 

${parameter’*pattern} Expand the value of parameter into all uppercase letters. 

${parameter*pattern} Expand the value of parameter, changing only the first char- 
acter to uppercase (capitalization). 


Here is a script that demonstrates these expansions: 


#!/bin/bash 
# ul-param: demonstrate case conversion via parameter expansion 


if [[ "$12" ]]; then 


echo "${1,,}" 


echo "${1, }" 
echo "${1**}" 
echo "${1*}" 


fi 


Here is the script in action: 


[me@linuxbox ~]$ ul-param aBc 
abc 
aBc 
ABC 
ABc 


Again, we process the first command line argument and output the 
four variations supported by the parameter expansions. While this script 
uses the first positional parameter, parameter may be any string, variable, or 
string expression. 


Arithmetic Evaluation and Expansion 


We looked at arithmetic expansion in Chapter 7. It is used to perform vari- 
ous arithmetic operations on integers. Its basic form is as follows: 


$( (expression) ) 


where expression is a valid arithmetic expression. 

This is related to the compound command ((_)) used for arithmetic 
evaluation (truth tests) we encountered in Chapter 27. 

In previous chapters, we saw some of the common types of expressions 
and operators. Here, we will look at a more complete list. 


Number Bases 


In Chapter 9, we got a look at octal (base 8) and hexadecimal (base 16) 
numbers. In arithmetic expressions, the shell supports integer constants in 
any base. Table 34-2 shows the notations used to specify the bases. 


Table 34-2: Specifying Different Number Bases 


Notation Description 

number By default, numbers without any notation are treated as decimal 
(base 10) integers. 

Onumber In arithmetic expressions, numbers with a leading zero are considered 
octal. 

Oxnumber Hexadecimal notation. 


basetnumber number is in base. 
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Here are some examples: 


[me@linuxbox ~]$ echo $((oxff)) 

255 

[me@linuxbox ~]$ echo $((2#11111111) ) 
255 


In the previous examples, we print the value of the hexadecimal number 
ff (the largest two-digit number) and the largest eight-digit binary (base 2) 
number. 


Unary Operators 


There are two unary operators, + and -, which are used to indicate whether 
a number is positive or negative, respectively. An example is -5. 


Simple Arithmetic 


Table 34-3 lists the ordinary arithmetic operators. 


Table 34-3: Arithmetic Operators 


Operator _ Description 

+ Addition 

- Subtraction 
Multiplication 

/ Integer division 
Exponentiation 


% Modulo (remainder) 


Most of these are self-explanatory, but integer division and modulo 
require further discussion. 

Since the shell’s arithmetic operates only on integers, the results of divi- 
sion are always whole numbers. 


[me@linuxbox ~]$ echo $(( 5 / 2 )) 
2 


This makes the determination of a remainder in a division operation 
more important. 


[me@linuxbox ~]$ echo $(( 5 % 2 )) 
1 


By using the division and modulo operators, we can determine that 5 
divided by 2 results in 2, with a remainder of 1. 


Calculating the remainder is useful in loops. It allows an operation to be 
performed at specified intervals during the loop’s execution. In the follow- 
ing example, we display a line of numbers, highlighting each multiple of 5: 


#!/bin/bash 
# modulo: demonstrate the modulo operator 


for ((i = 0; i <= 20; i= i+ 41)); do 
remainder=$((i % 5)) 
if (( remainder == 0 )); then 
printf "<%d> " "$i" 
else 
printf "%d " "$i" 
fi 
done 
printf "\n" 


When executed, the results look like this: 


[me@linuxbox ~]$ modulo 
<O> 123 4 <5> 6 7 8 9 <10> 11 12 13 14 <15> 16 17 18 19 <20> 


Assignment 


Although its uses may not be immediately apparent, arithmetic expressions 
may perform assignment. We have performed assignment many times, 
though in a different context. Each time we give a variable a value, we are 
performing assignment. We can also do it within arithmetic expressions. 


[me@linuxbox ~]$ foo= 
[me@linuxbox ~]$ echo $foo 


[me@linuxbox ~]$ if (( foo = 5 )); then echo "It is true."; fi 
It is true. 

[me@linuxbox ~]$ echo $foo 

5 


In the preceding example, we first assign an empty value to the variable 
foo and verify that it is indeed empty. Next, we perform an if with the com- 
pound command (( foo = 5 )). This process does two interesting things: it 
assigns the value of 5 to the variable foo, and it evaluates to true because foo 
was assigned a non-zero value. 


It is important to remember the exact meaning of = in the previous expression. A 
single = performs assignment. foo = 5 says “make foo equal to 5,” while == evaluates 
equivalence. foo == 5 says “does foo equal 5?” This is a common feature in many 
programming languages. In the shell, this can be a litile confusing because the test 
command accepts a single = for string equivalence. This is yet another reason to use 
the more modern [[ ]] and (( )) compound commands in place of test. 
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In addition to the = notation, the shell also provides notations that 
perform some very useful assignments, as described in Table 34-4. 


Table 34-4: Assignment Operators 

Notation Description 

parameter = value Simple assignment. Assigns value to parameter. 

parameter += value Addition. Equivalent to parameter = parameter + value. 
parameter -= value Subtraction. Equivalent to parameter = parameter - value. 
parameter *= value Multiplication. Equivalent to parameter = parameter * value. 
parameter /= value  \nteger division. Equivalent to parameter = parameter / value. 


parameter %= value Modulo. Equivalent to parameter = parameter % value. 


parameter++ Variable post-increment. Equivalent to parameter = parameter + | 
(however, see the following discussion). 

parameter-- Variable post-decrement. Equivalent to parameter = parameter - 1. 

++parameter Variable pre-increment. Equivalent to parameter = parameter + 1. 

--parameter Variable pre-decrement. Equivalent to parameter = parameter - 1. 


These assignment operators provide a convenient shorthand for many 
common arithmetic tasks. Of special interest are the increment (++) and 
decrement (--) operators, which increase or decrease the value of their 
parameters by one. This style of notation is taken from the C program- 
ming language and has been incorporated into a number of other pro- 
gramming languages, including bash. 

The operators may appear either at the front of a parameter or at the 
end. While they both either increment or decrement the parameter by 
one, the two placements have a subtle difference. If placed at the front of 
the parameter, the parameter is incremented (or decremented) before the 
parameter is returned. If placed after, the operation is performed after 
the parameter is returned. This is rather strange, but it is the intended 
behavior. Here is a demonstration: 


me@linuxbox ~]$ foo=1 
me@linuxbox ~]$ echo $((foo++)) 


me@linuxbox ~]$ echo $foo 


l 
[ 
1 
[ 
2 


If we assign the value of one to the variable foo and then increment it 
with the ++ operator placed after the parameter name, foo is returned with 
the value of one. However, if we look at the value of the variable a second 
time, we see the incremented value. If we place the ++ operator in front of 
the parameter, we get the more expected behavior. 


[me@linuxbox ~]$ foo=1 
[me@linuxbox ~]$ echo $((++foo)) 
2 


[me@linuxbox ~]$ echo $foo 


For most shell applications, prefixing the operator will be the most 
useful. 

The ++ and -- operators are often used in conjunction with loops. We 
will make some improvements to our modulo script to tighten it up a bit. 


#!/bin/bash 
# modulo2: demonstrate the modulo operator 


for ((i = 0; i <= 20; ++i )); do 
if (((i % 5) == 0 )); then 
printf "<%d> " "$i" 
else 
printf "%d " "$i" 
fi 
done 
printf "\n" 


Bit Operations 


One class of operators manipulates numbers in an unusual way. These 
operators work at the bit level. They are used for certain kinds of low-level 
tasks, often involving setting or reading bit flags (see Table 34-5). 


Table 34-5: Bit Operators 


Operator Description 


Bitwise negation. Negate all the bits in a number. 


«< Left bitwise shift. Shift all the bits in a number to the left. 
>> Right bitwise shift. Shift all the bits in a number to the right. 
& Bitwise AND. Perform an AND operation on all the bits in two numbers. 


| Bitwise OR. Perform an OR operation on all the bits in two numbers. 


Bitwise XOR. Perform an exclusive OR operation on all the bits in two 
numbers. 


Note that there are also corresponding assignment operators (for 
example, <<=) for all but bitwise negation. 

Here we will demonstrate producing a list of powers of 2, using the left 
bitwise shift operator: 


[me@linuxbox ~]$ for ((i=0;i<8;++i)); do echo $((1<<i)); done 
1 
2 
4 
8 
16 
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32 
64 
128 


Logic 


As we discovered in Chapter 27, the (( )) compound command supports a 
variety of comparison operators. There are a few more that can be used to 
evaluate logic. Table 34-6 provides the complete list. 


Table 34-6: Comparison Operators 


Operator Description 

<= Less than or equal to. 

>= Greater than or equal to. 
< Less than. 

> Greater than. 

== Equal to. 

|= Not equal to. 

88 Logical AND. 

|| Logical OR. 


expr1?expr2:expr3 Comparison (ternary) operator. If expression expr1 evaluates to 
be nonzero (arithmetic true), then expr2; else expr3. 


When used for logical operations, expressions follow the rules of arith- 
metic logic; that is, expressions that evaluate as zero are considered false, 
while non-zero expressions are considered true. The (( )) compound com- 
mand maps the results into the shell’s normal exit codes. 


[me@linuxbox ~]$ if ((1)); then echo "true"; else echo "false"; fi 
true 
[me@linuxbox ~]$ if ((0)); then echo "true"; else echo "false"; fi 
false 


The strangest of the logical operators is the ternary operator. This 
operator (which is modeled after the one in the C programming language) 
performs a stand-alone logical test. It can be used as a kind of if/then/else 
statement. It acts on three arithmetic expressions (strings won’t work), and 
if the first expression is true (or non-zero), the second expression is per- 
formed. Otherwise, the third expression is performed. We can try this on 
the command line: 


[me@linuxbox ~]$ a=0 
[me@linuxbox ~]$ ((a<1?++a:--a)) 
[me@linuxbox ~]$ echo $a 

1 


[me@linuxbox ~]$ ((a<1?++a:--a)) 
[me@linuxbox ~]$ echo $a 


Here we see a ternary operator in action. This example implements 
a toggle. Each time the operator is performed, the value of the variable a 
switches from zero to one or vice versa. 

Please note that performing assignment within the expressions is not 
straightforward. When attempted, bash will declare an error. 


[me@linuxbox ~]$ a=0 
[me@linuxbox ~]$ ((a<1?at=1:a-=1)) 
bash: ((: a<1?at=1:a-=1: attempted assignment to non-variable (error token is "-=1") 


This problem can be mitigated by surrounding the assignment expres- 
sion with parentheses. 


[me@linuxbox ~]$ ((a<1?(at=1):(a-=1))) 


Next is a more complete example of using arithmetic operators in a 
script that produces a simple table of numbers: 


#!/bin/bash 
# arith-loop: script to demonstrate arithmetic operators 


finished=0 

a=0 

printf "a\ta**2\ta**3\n" 
printf "=\t====\t====\n" 


until ((finished)); do 
b=$((a**2)) 
c=$((a**3)) 
printf "Zd\t%d\t%d\n"_ "$a" "$b" "$c" 
((a<10?++a: (finished=1) )) 
done 


In this script, we implement an until loop based on the value of the 
finished variable. Initially, the variable is set to zero (arithmetic false), 
and we continue the loop until it becomes non-zero. Within the loop, we 
calculate the square and cube of the counter variable a. At the end of the 
loop, the value of the counter variable is evaluated. If it is less than 10 (the 
maximum number of iterations), it is incremented by one, or else the vari- 
able finished is given the value of one, making finished arithmetically true, 
thereby terminating the loop. Running the script gives this result: 


[me@linuxbox ~]$ arith-loop 
a a**2 a*k*3 
0) 0) 0 
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1 1 1 

2 4 8 

3 9 27 

4 16 «64 

5 25 125 
6 36-216 
7 49 343 
8 64 512 
9 81-729 
10 100 1000 


bc—An Arbitrary Precision Calculator Language 
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We have seen how the shell can handle many types of integer arithmetic, 
but what if we need to perform higher math or even just use floating-point 
numbers? The answer is, we can’t. At least not directly with the shell. To do 
this, we need to use an external program. There are several approaches we 
can take. Embedding Perl or AWK programs is one possible solution, but 
unfortunately, it’s outside the scope of this book. 

Another approach is to use a specialized calculator program. One such 
program found on many Linux systems is called bc. 

The bc program reads a file written in its own C-like language and exe- 
cutes it. A bc script may be a separate file, or it may be read from standard 
input. The bc language supports quite a few features including variables, 
loops, and programmer-defined functions. We won't cover bc entirely here, 
just enough to get a taste. bc is well documented by its man page. 

Let’s start with a simple example. We’ll write a bc script to add 2 plus 2. 


/* A very simple bc script */ 


2+ 2 


The first line of the script is a comment. bc uses the same syntax for 
comments as the C programming language. Comments, which may span 
multiple lines, begin with /* and end with */. 


Using be 


If we save the previous bc script as foo.bc, we can run it this way: 


[me@linuxbox ~]$ bc foo.bc 

bc 1.06.94 

Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006 Free Software Foundation, Inc. 
This is free software with ABSOLUTELY NO WARRANTY. 

For details type “warranty'. 

4 


If we look carefully, we can see the result at the very bottom, after the 
copyright message. This message can be suppressed with the -q (quiet) 
option. 


bc can also be used interactively. 


[me@linuxbox ~]$ be -q 
2+2 

4 

quit 


When using bc interactively, we simply type the calculations we want to 
perform, and the results are immediately displayed. The bc command quit 
ends the interactive session. 

It is also possible to pass a script to bc via standard input. 


[me@linuxbox ~]$ be < foo.bc 


The ability to take standard input means that we can use here docu- 
ments, here strings, and pipes to pass scripts. This is a here string example: 


[me@linuxbox ~]$ be <<< "2+2" 


An Example Script 


As a real-world example, we will construct a script that performs a common 
calculation, monthly loan payments. In the script that follows, we use a here 
document to pass a script to be: 


#!/bin/bash 
# loan-calc: script to calculate monthly loan payments 


PROGNAME="${O##*/}" # Use parameter expansion to get basename 


usage () { 
cat <<- EOF 
Usage: $PROGNAME PRINCIPAL INTEREST MONTHS 


Where: 
PRINCIPAL is the amount of the loan. 


INTEREST is the APR as a number (7% = 0.07). 
MONTHS is the length of the loan's term. 


EOF 
} 
if (($# != 3)); then 
usage 
exit 1 
fi 


principal=$1 
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interest=$2 


months=$3 
bc <<- EOF 
scale = 10 


i = $interest / 12 
p = $principal 
n = $months 
a=p* ((i * ((1 +i) *n)) / (((4 + 4) * n) - 4)) 
print a, "\n" 
EOF 


When executed, the results look like this: 


[me@linuxbox ~]$ loan-calc 135000 0.0775 180 
1270. 7222490000 


This example calculates the monthly payment for a $135,000 loan at 
7.75 percent APR for 180 months (15 years). Notice the precision of the 
answer. This is determined by the value given to the special scale variable 
in the bc script. A full description of the bc scripting language is provided 
by the bc man page. While its mathematical notation is slightly different 
from that of the shell (bc more closely resembles C), most of it will be 
quite familiar, based on what we have learned so far. 


Summing Up 


In this chapter, we learned about many of the little things that can be used to 
get the “real work” done in scripts. As our experience with scripting grows, 
the ability to effectively manipulate strings and numbers will prove extremely 
valuable. Our loan-calc script demonstrates that even simple scripts can be 
created to do some really useful things. 


Extra Credit 
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While the basic functionality of the loan-calc script is in place, the script is 
far from complete. For extra credit, try improving the loan-calc script with 
the following features: 


e = Full verification of the command line arguments 


e Acommand line option to implement an “interactive” mode that will 
prompt the user to input the principal, interest rate, and term of the loan 


e Apbetter format for the output 


ARRAYS 


In the previous chapter, we looked at how 
the shell can manipulate strings and num- 
bers. The data types we have looked at so far 

are known in computer science circles as scalar 
variables; that is, they are variables that contain a single 


value. 

In this chapter, we will look at another kind of data structure called an 
array, which holds multiple values. Arrays are a feature of virtually every 
programming language. The shell supports them, too, though in a rather 
limited fashion. Even so, they can be very useful for solving some types of 
programming problems. 


What Are Arrays? 


Arrays are variables that hold more than one value at a time. Arrays are 
organized like a table. Let’s consider a spreadsheet as an example. A spread- 
sheet acts like a two-dimensional array. It has both rows and columns, and an 


438 


individual cell in the spreadsheet can be located according to its row and 
column address. An array behaves the same way. An array has cells, which 
are called elements, and each element contains data. An individual array ele- 
ment is accessed using an address called an index or subscript. 

Most programming languages support multidimensional arrays. A spread- 
sheet is an example of a multidimensional array with two dimensions, width 
and height. Many languages support arrays with an arbitrary number of 
dimensions, though two- and three-dimensional arrays are probably the 
most commonly used. 

Arrays in bash are limited to a single dimension. We can think of them 
as a spreadsheet with a single column. Even with this limitation, there are 
many applications for them. Array support first appeared in bash version 2. 
The original Unix shell program, sh, did not support arrays at all. 


Creating an Array 


Array variables are named just like other bash variables and are created 
automatically when they are accessed. Here is an example: 


[me@linuxbox ~]$ a[1]=foo 
[me@linuxbox ~]$ echo ${a[1]} 
foo 


Here we see an example of both the assignment and access of an array 
element. With the first command, element | of array a is assigned the value 
foo. The second command displays the stored value of element 1. The use 
of braces in the second command is required to prevent the shell from 
attempting pathname expansion on the name of the array element. 

An array can also be created with the declare command. 


[me@linuxbox ~]$ declare -a a 


Using the -a option, this example of declare creates the array a. 


Assigning Values to an Array 
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Values may be assigned in one of two ways. Single values may be assigned 
using the following syntax: 


name[ subscript ]=value 


where name is the name of the array and subscript is an integer (or arithmetic 
expression) greater than or equal to zero. Note that an array's first element 
is subscript zero, not one. value is a string or integer assigned to the array 
element. 

Multiple values may be assigned using the following syntax: 


name=(value1 value2 ...) 


where name is the name of the array and the value placeholders are values 
assigned sequentially to elements of the array, starting with element zero. 
For example, if we wanted to assign abbreviated days of the week to the 
array days, we could do this: 


[me@linuxbox ~]$ days=(Sun Mon Tue Wed Thu Fri Sat) 


It is also possible to assign values to a specific element by specifying a 
subscript for each value. 


[me@linuxbox ~]$ days=([0]=Sun [1]=Mon [2]=Tue [3]=Wed [4]=Thu [5]=Fri [6]=Sat) 


Accessing Array Elements 


So, what are arrays good for? Just as many data-management tasks can be 
performed with a spreadsheet program, many programming tasks can 
be performed with arrays. 

Let’s consider a simple data-gathering and presentation example. We 
will construct a script that examines the modification times of the files in 
a specified directory. From this data, our script will output a table showing 
at what hour of the day the files were last modified. Such a script could be 
used to determine when a system is most active. This script, called hours, 
produces this result: 


[me@linuxbox ~]$ hours . 
Hour Files Hour Files 


00 0 12 11 
01 1 13 7 
02 0 14 1 
03 0 15 7 
04 1 16 6 
05 1 17 5 
06 6 18 4 
07 3 19 4 
08 1 20 1 
09 14 21 0 
10 2 22 0 
11 5 23 0 


Total files = 80 


We execute the hours program, specifying the current directory as the 
target. It produces a table showing, for each hour of the day (0-23), how 
many files were last modified. The code to produce this is as follows: 


#!/bin/bash 


# hours: script to count files by modification time 
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usage () { 
echo "usage: ${o##*/} directory" >&2 
} 


# Check that argument is a directory 
if [[ ! -d "$1" ]]; then 

usage 

exit 1 
fi 


# Initialize array 
for i in {0..23}; do hours[i]=0; done 


# Collect data 

for i in $(stat -c %y "$1"/* | cut -c 12-13); do 
j="${ito}" 
((++hours[j])) 
((++count) ) 

done 


# Display data 
echo -e "Hour\tFiles\tHour\tFiles” 
echo -e "----\t----- \t----\t----- : 
for i in {0..11}; do 
j=$((i + 12)) 
printf "%02d\t%d\t%o2d\t%d\n" \ 
"$i" \ 
"${hours[i]}" \ 
"$j" \ 
"${hours[j]}" 
done 
printf "\nTotal files = %d\n" $count 


The script consists of one function (usage) and a main body with four 
sections. In the first section, we check that there is a command line argument 
and that it is a directory. If it is not, we display the usage message and exit. 

The second section initializes the array hours. It does this by assigning 
each element a value of zero. There is no special requirement to prepare 
arrays prior to use, but our script needs to ensure that no element is empty. 
Note the interesting way the loop is constructed. By employing brace expan- 
sion ({0..23}), we are able to easily generate a sequence of words for the for 
command. 

The next section gathers the data by running the stat program on 
each file in the directory. We use cut to extract the two-digit hour from the 
result. Inside the loop, we need to remove leading zeros from the hour field 
since the shell will try (and ultimately fail) to interpret values 00 through 
09 as octal numbers (see Table 34-2). Next, we increment the value of the 
array element corresponding with the hour of the day. Finally, we incre- 
ment a counter (count) to track the total number of files in the directory. 

The last section of the script displays the contents of the array. We first 
output a couple of header lines and then enter a loop that produces four 
columns of output. Lastly, we output the final tally of files. 


Array Operations 


There are many common array operations. Such things as deleting arrays, 
determining their size, sorting, and so on, have many applications in 
scripting. 


Outputting the Entire Contents of an Array 


The subscripts * and @ can be used to access every element in an array. As 
with positional parameters, the @ notation is the more useful of the two. 
Here is a demonstration: 


[me@linuxbox ~]$ animals=("a dog" "a cat" "a fish") 
[me@linuxbox ~]$ for i in ${animals[*]}; do echo $i; done 
a 

dog 

a 

cat 

a 

fish 


[me@linuxbox ~]$ for i in ${animals[@]}; do echo $i; done 

a 

dog 

a 

cat 

a 

fish 

[me@linuxbox ~]$ for i in "${animals[*]}"; do echo $i; done 
a dog a cat a fish 

[me@linuxbox ~]$ for i in "${animals[@]}"; do echo $i; done 
a dog 

a cat 

a fish 


We create the array animals and assign it three two-word strings. We 
then execute four loops to see the effect of word splitting on the array con- 
tents. The behavior of notations ${animals[*]} and ${animals[@]} is identical 
until they are quoted. The * notation results in a single word containing 
the array’s contents, while the @ notation results in three two-word strings, 
which matches the array’s “real” contents. 


Determining the Number of Array Elements 


Using parameter expansion, we can determine the number of elements in 
an array in much the same way as finding the length of a string. Here is an 
example: 


me@linuxbox ~]$ a[100]=foo 
me@linuxbox ~]$ echo ${#a[@]} # number of array elements 


[ 
[ 
1 
[me@linuxbox ~]$ echo ${#a[100]} # length of element 100 
3 
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We create array a and assign the string foo to element 100. Next, we use 
parameter expansion to examine the length of the array, using the @ nota- 
tion. Finally, we look at the length of element 100, which contains the string 
foo. It is interesting to note that while we assigned our string to element 100, 
bash reports only one element in the array. This differs from the behavior of 
some other languages in which the unused elements of the array (elements 
0-99) would be initialized with empty values and counted. In bash, array 
elements exist only if they have been assigned a value regardless of their 
subscript. 


Finding the Subscripts Used by an Array 


As bash allows arrays to contain “gaps” in the assignment of subscripts, it is 
sometimes useful to determine which elements actually exist. This can be 
done with a parameter expansion using the following forms: 


${!array[*]} 
${!array[@]} 


where array is the name of an array variable. Like the other expansions that 
use * and @, the @ form enclosed in quotes is the most useful, as it expands 
into separate words. 


me@linuxbox ~]$ foo=([2]=a [4]=b [6]=c) 
me@linuxbox ~]$ for i in "${foo[@]}"; do echo $i; done 


[ 
[ 
a 
b 
c 
[me@linuxbox ~]$ for i in "${!foo[@]}"; do echo $i; done 
2 
4 
6 


Adding Elements to the End of an Array 


Knowing the number of elements in an array is no help if we need to append 
values to the end of an array since the values returned by the * and @ nota- 
tions do not tell us the maximum array index in use. Fortunately, the shell 
provides us with a solution. By using the += assignment operator, we can auto- 
matically append values to the end of an array. Here, we assign three values 
to the array foo and then append three more: 


[me@linuxbox ~]$ foo=(a b c) 
[me@linuxbox ~]$ echo ${foo[@]} 
abc 

[me@linuxbox ~]$ foot=(d e f) 
[me@linuxbox ~]$ echo ${foo[@]} 
abcdef 


Sorting an Array 


Just as with spreadsheets, it is often necessary to sort the values in a column 
of data. The shell has no direct way of doing this, but it’s not hard to do 
with a little coding. 


#!/bin/bash 
# array-sort: Sort an array 
a=(f e dc ba) 


echo "Original array: ${a[@]} 
a_sorted=($(for i in "${a[@]}"; do echo $i; done | sort)) 
echo "Sorted array: ${a_sorted[@]}" 


When executed, the script produces this: 


[me@linuxbox ~]$ array-sort 
Original array: fedcba 
Sorted array: abcde f 


The script operates by copying the contents of the original array (a) 
into a second array (a_sorted) with a tricky piece of command substitution. 
This basic technique can be used to perform many kinds of operations on 
the array by changing the design of the pipeline. 


Deleting an Array 


To delete an array, use the unset command. 


[me@linuxbox ~]$ foo=(a b c de f) 
[me@linuxbox ~]$ echo ${foo[@]} 
abcdef 

[me@linuxbox ~]$ unset foo 
[me@linuxbox ~]$ echo ${foo[@]} 


[me@linuxbox ~]$ 


unset may also be used to delete single array elements. 


[me@linuxbox ~]$ foo=(a b c de f) 
[me@linuxbox ~]$ echo ${foo[@]} 
abcdef 

[me@linuxbox ~]$ unset 'foo[2]' 
[me@linuxbox ~]$ echo ${foo[@]} 


In this example, we delete the third element of the array, subscript 2. 
Remember, arrays start with subscript zero, not one! Notice also that the 
array element must be quoted to prevent the shell from performing path- 
name expansion. 
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Interestingly, the assignment of an empty value to an array does not 
empty its contents. 


[me@linuxbox ~]$ foo=(a b c de f) 
[me@linuxbox ~]$ foo= 

[me@linuxbox ~]$ echo ${foo[@]} 
bcedef 


Any reference to an array variable without a subscript refers to element 
zero of the array. 


[me@linuxbox ~]$ foo=(a b c de f) 
[me@linuxbox ~]$ echo ${foo[@]} 
abcdef 

[me@linuxbox ~]$ foo=A 
[me@linuxbox ~]$ echo ${foo[@]} 
Abcdef 


Associative Arrays 


bash versions 4.0 and greater support associative arrays. Associative arrays use 
strings rather than integers as array indexes. This capability allows interest- 
ing new approaches to managing data. For example, we can create an array 
called colors and use color names as indexes. 


declare -A colors 

colors[ "red" ]="#ff0000" 
colors["green" ]="#00ff00" 
colors["blue" ]="#0000ff" 


Unlike integer indexed arrays, which are created by merely referencing 
them, associative arrays must be created with the declare command using 
the new -A option. Associative array elements are accessed in much the 
same way as integer-indexed arrays. 


echo ${colors[ "blue" ]} 


In the next chapter, we will look at a script that makes good use of asso- 
ciative arrays to produce an interesting report. 


Summing Up 
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If we search the bash man page for the word array, we find many instances 
of where bash makes use of array variables. Most of these are rather obscure, 
but they may provide occasional utility in some special circumstances. In 
fact, the entire topic of arrays is rather under-utilized in shell programming 


owing largely to the fact that the traditional Unix shell programs (such as 
sh) lacked any support for arrays. This lack of popularity is unfortunate 
because arrays are widely used in other programming languages and pro- 
vide a powerful tool for solving many kinds of programming problems. 

Arrays and loops have a natural affinity and are often used together. 
The following form of loop is particularly well-suited to calculating array 
subscripts: 


for ((expr; expr; expr)) 
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In this, the final chapter of our journey, we 
will look at some odds and ends. While we 
have certainly covered a lot of ground in the 
previous chapters, there are many bash features 


that we have not covered. Most are fairly obscure and 
useful mainly to those integrating bash into a Linux distribution. However, 
there are a few that, while not in common use, are helpful for certain pro- 
gramming problems. We will cover them here. 


Group Commands and Subshells 


bash allows commands to be grouped together. This can be done in one of 
two ways, either with a group command or with a subshell. 
Here is the syntax of a group command: 


{ command1; command2; [command3; ...] } 
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Here is the syntax of a subshell: 


(command1; command2; [command3;...]) 


The two forms differ in that a group command surrounds its commands 
with braces and a subshell uses parentheses. It is important to note that 
because of the way bash implements group commands, the braces must be 
separated from the commands by a space and the last command must be 
terminated with either a semicolon or a newline prior to the closing brace. 

So, what are group commands and subshells good for? While they have 
an important difference (which we will get to in a moment), they are both 
used to manage redirection. Let’s consider a script segment that performs 
redirections on multiple commands. 


ls -1 > output.txt 
echo "Listing of foo.txt" >> output.txt 
cat foo.txt >> output.txt 


This is pretty straightforward. Three commands have their output redi- 
rected to a file named output.txt. Using a group command, we could code 
this as follows: 


{ ls -1; echo "Listing of foo.txt"; cat foo.txt; } > output.txt 


Using a subshell is similar. 


(1s -1; echo "Listing of foo.txt"; cat foo.txt) > output.txt 


Using this technique we have saved ourselves some typing, but where a 
group command or subshell really shines is with pipelines. When construct- 
ing a pipeline of commands, it is often useful to combine the results of sev- 
eral commands into a single stream. Group commands and subshells make 
this easy. 


{ ls -1; echo "Listing of foo.txt"; cat foo.txt; } | lpr 


Here we have combined the output of our three commands and piped 
them into the input of lpr to produce a printed report. 

In the script that follows, we will use groups commands and look at 
several programming techniques that can be employed in conjunction 
with associative arrays. This script, called array-2, when given the name 
of a directory, prints a listing of the files in the directory along with the 
names of the file’s owner and group owner. At the end of the listing, the 
script prints a tally of the number of files belonging to each owner and 
group. Here we see the results (condensed for brevity) when the script is 
given the directory /usr/bin: 


[me@linuxbox ~]$ array-2 /usr/bin 
/usx/bin/2to3-2.6 root root 


/usr/bin/2to3 
/usr/bin/a2p 
/usr/bin/abrowser 
/usr/bin/aconnect 
/usr/bin/acpi_fakekey 
/usr/bin/acpi_listen 
/usr/bin/add-apt-repository 
--snip-- 
/usr/bin/zipgrep 
/usr/bin/zipinfo 
/usr/bin/zipnote 
/usr/bin/zip 
/usr/bin/zipsplit 
/usr/bin/zjsdecode 
/usr/bin/zsoelim 


File owners: 
daemon 


root 


1 


1 file(s) 
394 file(s) 


File group owners: 


crontab 


daemon 


Ipadmin 


mail 


mlocate 


root 


shadow 


ssh 
tty 
utmp 


1 


1 file(s) 
1 file(s) 
1 file(s) 
4 file(s) 
1 file(s) 
380 file(s) 
2 file(s) 
1 file(s) 
2 file(s) 
2 file(s) 


root 
root 
root 
root 
root 
root 
root 


root 
root 
root 
root 
root 
root 
root 


root 
root 
root 
root 
root 
root 
root 


root 
root 
root 
root 
root 
root 
root 


Here is a listing (with line numbers) of the script: 


1 
2 
3 
4 
5 
6 
7 
8 
9 
0 


#!/b 


# array-2: Use arrays to tally file owners 


declare -A files file_group file_owner groups owners 


if [ 


fi 


for 


done 


in/bash 


[ ! -d "$1" ]]; then 

echo "Usage: array-2 dir" > 
exit 1 

i in "$1"/*; do 


owner="$(stat -c %U "$i")" 
group="$(stat -c %G "$i")" 
files["$i"]="$i" 
file_owner["$i" ]="$owner" 
file_group["$i" ]="$group" 
((++owners[$owner])) 
((++groups[$group])) 


ocr 


&2 
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21 
22 # List the collected files 
23 { for i in "${files[@]}"; do 


24 printf "%-40s %-10s %-10s\n" \ 

25 "$i" "${file_owner["$i"]}" "${file_group["$i"]}" 
26 done } | sort 

27 echo 

28 


29 # List owners 
30 echo "File owners:" 
31 { for i in "${!owners[@]}"; do 


32 printf "%-10s: %5d file(s)\n" "$i" "${owners["$i"]}" 
33 done } | sort 

34 echo 

35 


36 # List groups 

37 echo "File group owners:" 

38 { for i in "${!groups[@]}"; do 

39 printf "%-10s: %5d file(s)\n" "$i" "${groups["$i"]}" 
40 done } | sort 


Let’s take a look at the mechanics of this script. 


Line 5: Associative arrays must be created with the declare command 
using the -A option. In this script, we create five arrays as follows: 


e files contains the names of the files in the directory, indexed by 
filename. 


e  file_group contains the group owner of each file, indexed by 
filename. 


e file_owner contains the owner of each file, indexed by filename. 
© groups contains the number of files belonging to the indexed group. 
© owners contains the number of files belonging to the indexed owner. 


Lines 7-10: These lines check to see that a valid directory name was 
passed as a positional parameter. If not, a usage message is displayed, 
and the script exits with an exit status of 1. 


Lines 12-20: These lines loop through the files in the directory. Using 
the stat command, lines 13 and 14 extract the names of the file owner 
and group owner and assign the values to their respective arrays (lines 
16 and 17) using the name of the file as the array index. Likewise, the 

filename itself is assigned to the files array (line 15). 


Lines 18-19: The total number of files belonging to the file owner and 
group owner are incremented by one. 


Lines 22-27: The list of files is output. This is done using the "${array[@]}" 
parameter expansion, which expands into the entire list of array ele- 
ments with each element treated as a separate word. This allows for 

the possibility that a filename may contain embedded spaces. Also 

note that the entire loop is enclosed in braces thus forming a group 


command. This permits the entire output of the loop to be piped into 
the sort command. This is necessary because the expansion of the array 
elements is not sorted. 


Lines 29-40: These two loops are similar to the file list loop except that 
they use the "${!array[@]}" expansion, which expands into the list of array 
indexes rather than the list of array elements. 


Process Substitution 


While they look similar and can both be used to combine streams for redi- 
rection, there is an important difference between group commands and 
subshells. Whereas a group command executes all of its commands in the 
current shell, a subshell (as the name suggests) executes its commands in a 
child copy of the current shell. This means the environment is copied and 
given to a new instance of the shell. When the subshell exits, the copy of 
the environment is lost, so any changes made to the subshell’s environment 
(including variable assignment) are lost as well. Therefore, in most cases, 
unless a script requires a subshell, group commands are preferable to sub- 
shells. Group commands are both faster and require less memory. 

We saw an example of the subshell environment problem in Chapter 28, 
when we discovered that a read command in a pipeline does not work as we 
might intuitively expect. To recap, if we construct a pipeline like this: 


echo "foo" | read 
echo $REPLY 


the content of the REPLY variable is always empty because the read command 
is executed in a subshell, and its copy of REPLY is destroyed when the subshell 
terminates. 

Because commands in pipelines are always executed in subshells, any 
command that assigns variables will encounter this issue. Fortunately, the 
shell provides an exotic form of expansion called process substitution that can 
be used to work around this problem. 

Process substitution is expressed in two ways. 

For processes that produce standard output, it looks like this: 


<(list) 


For processes that intake standard input, it looks like this: 


>(list) 


where list is a list of commands. 


To solve our problem with read, we can employ process substitution 
like this. 


read < <(echo "foo") 
echo $REPLY 
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Process substitution allows us to treat the output of a subshell as an 
ordinary file for purposes of redirection. In fact, since it is a form of expan- 
sion, we can examine its real value. 


[me@linuxbox ~]$ echo <(echo "foo") 
/dev/fd/63 


By using echo to view the result of the expansion, we see that the output 
of the subshell is being provided by a file named /deu/fd/63. 

Process substitution is often used with loops containing read. Here is an 
example of a read loop that processes the contents of a directory listing cre- 
ated by a subshell: 


#!/bin/bash 
# pro-sub: demo of process substitution 


while read attr links owner group size date time filename; do 


cat << EOF 
Filename: $filename 
Size: $size 
Owner: gowner 
Group: $group 
Modified: $date $time 
Links: $links 


Attributes: $attr 


EOF 
done < <(1s -1 --time-style="+%F %H:%m" | tail -n +2) 


The loop executes read for each line of a directory listing. The listing 
itself is produced on the final line of the script. This line redirects the out- 
put of the process substitution into the standard input of the loop. The tail 
command is included in the process substitution pipeline to eliminate the 
first line of the listing, which is not needed. 

When executed, the script produces output like this: 


[me@linuxbox ~]$ pro-sub | head -n 20 
Filename: addresses.ldif 


Size: 14540 

Owner: me 

Group: me 

Modified: 2009-04-02 11:12 
Links: 1 


Attributes: -rw-r--r- 


Filename: bin 


Size: 4096 

Owner: me 

Group: me 

Modified: 2009-07-10 07:31 
Links: 2 


Traps 


Attributes: drwxr-xr-x 


Filename: bookmarks. html 


Size: 394213 
Owner: me 
Group: me 


In Chapter 10, we saw how programs can respond to signals. We can add 
this capability to our scripts, too. While the scripts we have written so far 
have not needed this capability (because they have very short execution 
times and do not create temporary files), larger and more complicated 
scripts may benefit from having a signal handling routine. 

When we design a large, complicated script, it is important to consider 
what happens if the user logs off or shuts down the computer while the script 
is running. When such an event occurs, a signal will be sent to all affected 
processes. In turn, the programs representing those processes can perform 
actions to ensure a proper and orderly termination of the program. Let’s say, 
for example, that we wrote a script that created a temporary file during its 
execution. In the course of good design, we would have the script delete the 
file when the script finishes its work. It would also be smart to have the script 
delete the file if a signal is received indicating that the program was going to 
be terminated prematurely. 

bash provides a mechanism for this purpose known as a trap. Traps are 
implemented with the appropriately named builtin command, trap. trap 
uses the following syntax: 


trap argument signal [signal...] 


where argument is a string that will be read and treated as a command and 
signal is the specification of a signal that will trigger the execution of the 
interpreted command. 

Here is a simple example: 


#!/bin/bash 
# trap-demo: simple signal handling demo 
trap "echo 'I am ignoring you.'" SIGINT SIGTERM 
for i in {1..5}; do 
echo "Iteration $i of 5" 


sleep 5 
done 


This script defines a trap that will execute an echo command each time 
either the SIGINT or SIGTERM signal is received while the script is running. 
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Execution of the program looks like this when the user attempts to stop the 
script by pressing CTRL-C: 


[me@linuxbox ~]$ trap-demo 
Iteration 1 of 5 

Iteration 2 of 5 

“CI am ignoring you. 
Iteration 3 of 5 

“CI am ignoring you. 
Iteration 4 of 5 

Iteration 5 of 5 


As we can see, each time the user attempts to interrupt the program, 
the message is printed instead. 

Constructing a string to form a useful sequence of commands can be 
awkward, so it is common practice to specify a shell function as the com- 
mand. In this example, a separate shell function is specified for each signal 
to be handled: 


#!/bin/bash 
# trap-demo2: simple signal handling demo 


exit_on_signal SIGINT () { 
echo "Script interrupted." 2>&1 
exit 0 


} 


exit_on_signal_SIGTERM () { 
echo "Script terminated." 2>8&1 
exit 0 


} 


trap exit_on_signal_SIGINT SIGINT 
trap exit_on_signal_SIGTERM SIGTERM 


for i in {1..5}; do 
echo "Iteration $i of 5" 
sleep 5 

done 


This script features two trap commands, one for each signal. Each trap, 
in turn, specifies a shell function to be executed when the particular signal 
is received. Note the inclusion of an exit command in each of the signal- 
handling functions. Without an exit, the script would continue after com- 
pleting the function. 

When the user presses CTRL-C during the execution of this script, the 
results look like this: 


[me@linuxbox ~]$ trap-demo2 
Iteration 1 of 5 

Iteration 2 of 5 

“CScript interrupted. 


TEMPORARY FILES 


One reason signal handlers are included in scripts is to remove temporary files 
that the script may create to hold intermediate results during execution. There is 
something of an art to naming temporary files. Traditionally, programs on Unix- 
like systems create their temporary files in the /tmp directory, a shared directory 
intended for such files. However, since the directory is shared, this poses certain 
security concerns, particularly for programs running with superuser privileges. 
Aside from the obvious step of setting proper permissions for files exposed to 

all users of the system, it is important to give temporary files nonpredictable file- 
names. This avoids an exploit known as a femp race attack. One way to create 
a nonpredictable (but still descriptive) name is to do something like this: 


tempfile=/tmp/$(basename $0) .$$.$RANDOM 


This will create a filename consisting of the program’s name, followed by 
its process ID (PID), followed by a random integer. Note, however, that the 
$RANDOM shell variable returns a value only in the range of 1-32767, which is 
not a large range in computer terms, so a single instance of the variable is not 
sufficient to overcome a determined attacker. 

A better way is to use the mktemp program (not to be confused with the 
mktemp standard library function) to both name and create the temporary file. 
The mktemp program accepts a template as an argument that is used to build 
the filename. The template should include a series of X characters, which are 
replaced by a corresponding number of random letters and numbers. The 
longer the series of X characters, the longer the series of random characters. 
Here is an example: 


tempfile=$(mktemp /tmp/foobar .$$.XXXXXXXXXX) 


This creates a temporary file and assigns its name to the variable tempfile. 
The X characters in the template are replaced with random letters and numbers 
so that the final filename (which, in this example, also includes the expanded 
value of the special parameter $$ to obtain the PID) might be something like this: 


/tmp/foobar .6593 .U0ZuvM6654 


For scripts that are executed by regular users, it may be wise to avoid the 


use of the /tmp directory and create a directory for temporary files within the 
user’s home directory, with a line of code such as this: 


[[ -d $HOME/tmp ]] || mkdir $HOME/tmp 
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It is sometimes desirable to perform more than one task at the same time. 
We have seen how all modern operating systems are at least multitasking if 
not multiuser as well. Scripts can be constructed to behave in a multitasking 
fashion. 

Usually this involves launching a script that, in turn, launches one or 
more child scripts to perform an additional task while the parent script 
continues to run. However, when a series of scripts runs this way, there can 
be problems keeping the parent and child coordinated. That is, what if the 
parent or child is dependent on the other and one script must wait for the 
other to finish its task before finishing its own? 

bash has a builtin command to help manage asynchronous execution such 
as this. The wait command causes a parent script to pause until a specified 
process (i.e., the child script) finishes. To demonstrate this, we will need 
two scripts. The first is a parent script. 


#!/bin/bash 

# async-parent: Asynchronous execution demo (parent) 
echo "Parent: starting..." 

echo "Parent: launching child script..." 

async-child & 

pid=$! 

echo "Parent: child (PID= $pid) launched." 


echo "Parent: continuing..." 
sleep 2 


echo "Parent: pausing to wait for child to finish..." 
wait "$pid" 


echo "Parent: child is finished. Continuing..." 
echo "Parent: parent is done. Exiting." 


This second is a child script. 


#!/bin/bash 
# async-child: Asynchronous execution demo (child) 
echo "Child: child is running..." 


sleep 5 
echo "Child: child is done. Exiting." 


In this example, we see that the child script is simple. The real action 
is being performed by the parent. In the parent script, the child script is 
launched and put into the background. The process ID of the child script 


is recorded by assigning the pid variable with the value of the $! shell 
parameter, which will always contain the process ID of the last job put 
into the background. 

The parent script continues and then executes a wait command with the 
PID of the child process. This causes the parent script to pause until the child 
script exits, at which point the parent script concludes. 

When executed, the parent and child scripts produce the following 
output: 


[me@linuxbox ~]$ async-parent 

Parent: starting... 

Parent: launching child script... 

Parent: child (PID= 6741) launched. 

Parent: continuing... 

Child: child is running... 

Parent: pausing to wait for child to finish... 
Child: child is done. Exiting. 

Parent: child is finished. Continuing... 
Parent: parent is done. Exiting. 


Named Pipes 


In most Unix-like systems, it is possible to create a special type of file called 
a named pipe. Named pipes are used to create a connection between two 
processes and can be used just like other types of files. They are not that 
popular, but they’re good to know about. 

There is a common programming architecture called client-server, which 
can make use of a communication method such as named pipes, as well as 
other kinds of interprocess communication such as network connections. 

The most widely used type of client-server system is, of course, a web 
browser communicating with a web server. The web browser acts as the client, 
making requests to the server, and the server responds to the browser with 
web pages. 

Named pipes behave like files but actually form first-in first-out (FIFO) 
buffers. As with ordinary (unnamed) pipes, data goes in one end and emerges 
out the other. With named pipes, it is possible to set up something like this: 


process1 > named_pipe 


and this: 


process2 < named_pipe 


and it will behave like this: 


process1 | process2 
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Setting Up a Named Pipe 


First, we must create a named pipe. This is done using the mkfifo command. 


[me@linuxbox ~]$ mkfifo pipet 
[me@linuxbox ~]$ 1s -1 pipet 
prw-r--r-- 1 me me 0 2018-07-17 06:41 pipe1 


Here we use mkfifo to create a named pipe called pipe1. Using 1s, we 
examine the file and see that the first letter in the attributes field is p, indi- 
cating that it is a named pipe. 


Using Named Pipes 


To demonstrate how the named pipe works, we will need two terminal win- 
dows (or alternately, two virtual consoles). In the first terminal, we enter a 
simple command and redirect its output to the named pipe. 


[me@linuxbox ~]$ 1s -1 > pipe1 


After we press ENTER, the command will appear to hang. This is because 
there is nothing receiving data from the other end of the pipe yet. When 
this occurs, it is said that the pipe is blocked. This condition will clear once 
we attach a process to the other end and it begins to read input from the 
pipe. Using the second terminal window, we enter this command: 


[me@linuxbox ~]$ cat < pipet 


The directory listing produced from the first terminal window appears in 
the second terminal as the output from the cat command. The 1s command 
in the first terminal successfully completes once it is no longer blocked. 


Summing Up 
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Well, we have completed our journey. The only thing left to do now is 
practice, practice, practice. Even though we covered a lot of ground in 
our trek, we barely scratched the surface as far as the command line goes. 
There are still thousands of command line programs left to be discovered 
and enjoyed. Start digging around in /usr/bin and you'll see! 
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